Breach data, dark web intelligence and OSINT tooling

Breach data sources, dark web search, OSINT frameworks, automation tools (theHarvester, recon-ng, Maltego, SpiderFoot, Amass) and target-profile methodology.

Breach data

Billions of credentials from thousands of data breaches are publicly indexed. Checking whether a target's email addresses or domains appear in breach data yields:

Confirmed email address formats and specific addresses in use
Potentially usable passwords for credential-stuffing or password-spray attacks against current services
Associated usernames across platforms
Personal data (names, addresses, security-question answers) that enables social engineering

Have I Been Pwned

HIBP (haveibeenpwned.com) is the authoritative first check for email address exposure, maintained by Troy Hunt. It indexes billions of records from hundreds of breaches with full breach metadata:

Email search: shows every breach an address appears in, with breach date and data types exposed
Domain search: the most useful feature for corporate investigations: enter a domain and get a full report of all employee addresses appearing in breaches, without needing to query individually
Pwned Passwords (haveibeenpwned.com/Passwords): checks a password hash against 750M+ breached passwords using k-anonymity (only the first 5 characters of the SHA-1 hash are transmitted; the full password never leaves your machine). Integrate into login flows to block known-breached passwords.
API: free programmatic access for integration into monitoring tools and custom scripts

DeHashed

DeHashed (dehashed.com) is a subscription-based breach search engine that returns the actual breach data, not merely a flag that a breach occurred. Search by email, username, IP address, name, phone number, VIN or domain, and retrieve the full record including password hashes or plaintext passwords where available. DeHashed is the tool of choice for understanding the full credential exposure for a target, invaluable for demonstrating real impact on a pentest.

Intelligence X

Intelligence X (intelx.io) indexes dark web sites, paste sites (Pastebin and many alternatives), leaked breach databases and other sources, and searches them simultaneously in one query. It excels at:

Paste site mentions not indexed by Google or other tools
Dark web references to a target's domain, executive names or IP ranges
Older breach data not yet included in other databases
Historical records of URLs and domains that have since been taken down

A limited free tier exists; the full API and historical depth require a subscription.

Other breach sources

Snusbase (snusbase.com): large breach database with API access, particularly strong on older breaches
LeakCheck (leakcheck.io): breach search with a free tier and programmatic API
Breachbase and underground forums: the freshest breach data first appears on criminal forums before it's indexed by HIBP or DeHashed. Monitoring (read-only; never participate) requires Tor Browser access. Intelligence X's dark web indexing partly automates this.

Dark web intelligence

The Tor network routes traffic through multiple relays and exits, providing anonymity for both clients and hidden services (.onion addresses). Dark web OSINT (monitoring .onion sites for threat intelligence) is legitimate for threat analysts, security researchers and journalists.

Dark web search engines

Accessible via Tor Browser (torproject.org):

Ahmia (ahmia.fi): the most reliable Tor search engine, indexing .onion sites while filtering known CSAM. Also accessible on the clearnet, making it useful without needing Tor for initial searches.
Torch: one of the oldest .onion search engines; large raw index with no filtering.
DarkSearch (darksearch.io): clearnet interface that indexes dark web content with a free API, useful for automated monitoring.
Onion Search Engine (onionsearchengine.com): aggregates results from multiple .onion search engines.

OPSEC for dark web access

Always use Tor Browser (not just the Tor daemon in another browser). Do not resize the browser window (canvas and window size are fingerprinting vectors). Do not log into any clearnet account from within Tor Browser. Use a dedicated VM and snapshot it before each session so you can revert cleanly. Never download executable files via Tor without extreme caution.

Threat intelligence platforms

Commercial platforms provide automated dark web monitoring for enterprise customers:

Recorded Future (recordedfuture.com): real-time threat intelligence with dark web, paste and forum monitoring
Flashpoint (flashpoint.io): criminal forum monitoring with expert analysis
Digital Shadows / ReliaQuest: digital risk monitoring including exposed credentials and dark web mentions

OSINT frameworks and automation

OSINT Framework

osintframework.com is a free visual directory of OSINT tools, organised by category: username, email address, domain name, IP address, social networks, geolocation, dark web, and more. It is not an active tool; it is a map of available resources. Use it when you need to find what tools exist for a specific intelligence type you haven't encountered before.

theHarvester

theHarvester is one of the most versatile automated collection tools. It queries multiple public sources simultaneously for a target domain and returns email addresses, names, subdomains, IP addresses and virtual hosts:

theHarvester -d example.com -b all                    # all available sources
theHarvester -d example.com -b google,linkedin,shodan,hunter
theHarvester -d example.com -b certspotter             # certificate transparency only
theHarvester -d example.com -b google -l 500 -f report # save output to HTML/XML

A good automated first step for any domain-based investigation.

recon-ng

recon-ng is a modular reconnaissance framework modelled on Metasploit's architecture. It maintains a persistent workspace database (hosts, contacts, credentials, locations) and provides modules for every major OSINT data source:

recon-ng
[recon-ng] > workspaces create acmecorp
[recon-ng] > marketplace install recon/domains-hosts/bing_domain_web
[recon-ng] > modules load recon/domains-hosts/bing_domain_web
[recon-ng] > options set SOURCE example.com
[recon-ng] > run
[recon-ng] > show hosts

The workspace accumulates all findings across modules, and the structured output is suitable for documentation. recon-ng has modules for WHOIS lookups, Shodan queries, LinkedIn scraping, GitHub enumeration, breach data and geolocation.

Maltego

Maltego (commercial with a free Community Edition) is a graphical link-analysis platform. Rather than producing lists, it visualises relationships between entities (domains, people, email addresses, IP addresses, social media accounts, organisations) as a graph. This makes it powerful for mapping complex relationships and finding non-obvious connections between data points.

Maltego uses transforms (API calls to various data sources) to expand a starting entity: add a domain, run a transform, get IPs; run another, get email addresses from Hunter.io; run another, get social profiles. The Community Edition is sufficient for educational use but is rate-limited.

SpiderFoot

SpiderFoot (spiderfoot.net) is an open-source automated OSINT tool that queries over 200 data sources for a target domain, IP address, email address, username or subnet. It provides both a CLI and a web UI:

spiderfoot -s example.com -t all -o spiderfoot-report.html

Output includes DNS and subdomain findings, open ports (via Shodan), email addresses, social media profiles, breach data and threat intelligence feed hits. SpiderFoot HX is the commercial hosted version with additional modules and scheduled monitoring.

Amass

Amass (OWASP) is the de facto standard for deep subdomain and attack surface enumeration. It combines active DNS enumeration with passive sources (certificate transparency, Shodan, Censys, SecurityTrails, BGP data, and more) to produce a comprehensive attack surface map:

amass enum -d example.com -passive                    # passive only, no active DNS
amass enum -d example.com -active -brute              # active with brute-force
amass db -names -d example.com                         # query results from local db
amass viz -d3 -d example.com                          # visualise as interactive D3 graph

Amass also resolves discovered subdomains, tests for wildcard DNS, and feeds results into a local database for incremental runs.

Additional tools

Photon (github.com/s0md3v/Photon): crawls a website and extracts URLs, email addresses, social media accounts, files and endpoints. An OSINT-focused web crawler.
Maigret (github.com/soxoj/maigret): username-to-profile tool that also extracts linked accounts and builds a social graph
Reconspider and Sublist3r: subdomain enumeration alternatives
Metagoofil and ExifTool: document metadata extraction (see Search and Web chapter)

Building a target profile: the workflow

A structured methodology brings these tools together without duplicating effort:

Seed: begin with a domain name, organisation name, or known email address
Infrastructure: WHOIS → name servers → crt.sh (subdomains) → Shodan/Censys (exposed services) → BGP.he.net (IP ranges)
People: Hunter.io (email format and addresses) → theHarvester (bulk collection) → LinkedIn (employee enumeration) → Holehe (email-to-accounts mapping)
History and exposure: Wayback Machine (deleted content) → GitHub search (secrets in code) → breach databases (HIBP domain search, DeHashed, IntelX)
Geolocation: ExifTool on any collected documents and images, WiGLE for network infrastructure, satellite imagery for facility layout
Synthesis: assemble findings in a link-analysis tool or diagram. Identify the highest-value targets: exposed credentials, forgotten subdomains with old software, high-privilege email addresses, services running on non-standard ports
Documentation: record every finding with source, exact query and timestamp. OSINT is a snapshot; data changes and provenance matters

Attribution caveats

OSINT findings establish correlation, not proof. An IP address that scanned your network and resolves to a VPS at a particular hosting provider tells you where the traffic came from, not who controlled it. A username match across platforms is a strong lead, not confirmed identity. Anyone can host from a VPS in Country X; anyone can register an email and reuse a username.

Report what you can observe directly and be explicit about the difference between observation, inference and speculation. State confidence levels. In any legal or professional context, overstating certainty from OSINT is as damaging as missing the finding entirely.

Quick recall

HIBP: email breach check (free) and domain search (returns all breached employee addresses). Pwned Passwords API uses k-anonymity. DeHashed: full breach record with credentials (paid). IntelX: searches pastes + dark web + breaches together.
Tor search: Ahmia.fi is most reliable and is also on the clearnet. Use Tor Browser with OPSEC (dedicated VM, no window resize, no clearnet logins).
OSINT Framework (osintframework.com): tool directory by category; use when you need to find what exists for a specific intelligence type.
theHarvester: multi-source email/host collection in one command. recon-ng: modular, persistent workspace. Maltego: graphical link analysis. SpiderFoot: 200-source automated scan. Amass: deep subdomain enumeration.
Workflow: infrastructure → people → history/exposure → geolocation → synthesis. Document every step with source and timestamp.
Attribution: correlation ≠ proof. State confidence levels; distinguish observation from inference.