Search engine dorking and web archives

Google and Bing dork operators, the Wayback Machine, paste sites, GitHub secret scanning and document metadata extraction.

Search engine operators (dorking)

Web search engines support advanced operators that narrow results to specific sites, file types, page sections and content patterns. When used deliberately to find sensitive or exposed information, this is called Google dorking or search engine dorking. The technique requires no tools beyond a browser and no interaction with any target system, but it can surface exposed admin panels, configuration files, credential leaks, directory listings and version-disclosing error messages.

Google operator reference

Operator	Example	What it finds
`site:`	`site:example.com`	All indexed pages on a domain
`site:` with path	`site:example.com/admin`	Pages under a specific path
`filetype:`	`filetype:pdf site:example.com`	Specific file extensions
`intitle:`	`intitle:"index of"`	Pages with text in the `<title>`
`inurl:`	`inurl:admin`	URLs containing a string
`intext:`	`intext:"password"`	Pages with text in the body
`cache:`	`cache:example.com`	Google's cached copy of a page
`-` (minus)	`site:example.com -www`	Exclude results
`"quotes"`	`"database error"`	Exact phrase match
`OR`	`site:example.com OR site:example.net`	Either term
`*`	`"admin * password"`	Wildcard for any word
`before:` / `after:`	`after:2024-01-01 site:example.com`	Date-bounded results

High-value dork patterns

These combinations regularly surface sensitive information in CTFs and real-world assessments:

site:example.com filetype:pdf                         # all PDFs (check for metadata)
site:example.com filetype:xml OR filetype:conf OR filetype:log
intitle:"index of" site:example.com                   # open directory listings
inurl:phpinfo.php site:example.com                    # PHP info pages (leaks server config)
site:example.com inurl:".git"                         # exposed .git directories
site:example.com intext:"sql syntax near"             # SQL error messages
"BEGIN RSA PRIVATE KEY" site:github.com               # private keys in public repos
site:pastebin.com "example.com"                       # pastes referencing the target
inurl:wp-admin site:example.com                       # WordPress admin interfaces
site:example.com intext:"Index of /" intitle:"Index of"
"X-Forwarded-For" filetype:log                        # log files leaking headers

The Google Hacking Database (GHDB) at exploit-db.com/google-hacking-database catalogues thousands of working dorks categorised by type: files containing passwords, sensitive directories, error messages, exposed login portals. It is a reference worth bookmarking before any assessment.

Bing and DuckDuckGo

Bing supports site:, filetype:, inurl: and intitle: with similar syntax and indexes different content from Google. Running the same dork pattern on Bing after Google sometimes surfaces results that Google's algorithm has de-prioritised, particularly older pages.

DuckDuckGo supports site: and filetype: and does not personalise results, which is useful when you want a view unclouded by your search history.

The Wayback Machine

The Wayback Machine (web.archive.org) archives web snapshots going back to 1996. For OSINT, it provides access to content that no longer exists on the live site:

Deleted configuration files, employee directories and product pages
Previous versions of a page before a cleanup
Sites that have been taken down entirely
Old JavaScript files containing hardcoded API keys or endpoints that were removed from the live version

Enter a URL to see the calendar of archived snapshots. To browse all URLs ever captured for a domain, use the wildcard:

https://web.archive.org/web/*/example.com/*

Waybackurls is a CLI tool that extracts all historically indexed URLs for a domain from the Wayback Machine, Common Crawl and URLScan:

waybackurls example.com | sort -u

This surfaces endpoints and paths that don't appear in the current sitemap: development routes, deprecated APIs, parameter names that reveal application structure.

URLScan.io (urlscan.io) stores the results of URL scans submitted by security tools. Searching urlscan for a domain reveals historical scans, linked resources, JavaScript endpoints and external requests, giving another timeline of how the site has looked.

Paste sites

When credentials, source code, configuration data or API keys leak, they frequently appear on paste sites, ephemeral text-hosting services where anyone can share text anonymously.

Key sources to check and monitor:

Pastebin (pastebin.com): the largest and most searched. Monitor with: site:pastebin.com "targetcompany.com"
GitHub Gist (gist.github.com): public gists are findable. Developers often paste credentials as "secret" gists while troubleshooting, not realising "secret" means only obscured, not private.
ControlC (controlc.com) and Hastebin: alternative services with different indexing
Intelligence X (intelx.io): searches Pastebin, dark web pastes, leaked breach databases and many other sources simultaneously; the most efficient single search for paste exposure

Automating paste monitoring: set Google Alerts for site:pastebin.com "[target]", or use PasteHunter for self-hosted automated collection.

GitHub OSINT

GitHub and similar code hosts are a major source of accidentally exposed secrets. Developers commit .env files, config files with API keys, hardcoded passwords and private keys, then may push a "removal" commit not realising the full git history remains publicly accessible.

GitHub code search operators:

"example.com" password                  # credentials mentioning the domain
"BEGIN RSA PRIVATE KEY"                 # RSA private keys
"api_key" "example.com"                 # API keys with domain context
user:johndoe filename:.env              # .env files in a specific user's repos
org:acmecorp filename:config.yml        # config files in an organisation
extension:sql password                  # SQL dumps containing passwords
"DB_PASSWORD" "example.com"             # database credentials
"aws_access_key_id"                     # AWS credentials
"Authorization: Bearer"                 # hardcoded tokens

The git history is not deleted by a commit that removes a file. Even if a secret is removed in a later commit, it persists in the history. View it with:

git log --oneline --all
git show <commit-hash>

Automated secret-scanning tools:

TruffleHog: scans git history for high-entropy strings and known credential patterns: trufflehog git https://github.com/org/repo
GitLeaks: similar, with a comprehensive rule set for common service credential formats
GitHound: searches GitHub at scale for a target's credentials across all public repositories, not only their own

Document metadata

Documents published online carry invisible metadata: the author's full name, organisation, software version used to create it, internal file path (which may reveal internal server names), and editing history. PDF, Word, Excel and PowerPoint files are the most common carriers.

ExifTool extracts metadata from any file type:

exiftool document.pdf              # PDF metadata: author, creator application, dates
exiftool image.jpg                 # photo metadata including GPS coordinates
exiftool -r ./documents/           # recursive extraction from an entire directory

For harvesting documents from a target website automatically:

Metagoofil: uses Google to find and download documents from a target domain, then extracts metadata from all of them: metagoofil -d example.com -t pdf,doc,xls -l 50 -o ./results
FOCA (Windows GUI): similar, with a graphical interface that visualises metadata relationships and inferred internal infrastructure

GPS coordinates in photos are common in CTF challenges. Images taken on smartphones embed coordinates into EXIF by default unless the camera app is configured to strip them. ExifTool reveals them in seconds.

Monitoring and alerting

For ongoing intelligence against a target:

Google Alerts (google.com/alerts): email notifications for new results matching a query. Set "example.com" breach or "Acme Corporation" credentials to be alerted when fresh material appears.
Social media monitoring tools (Tweetdeck/advanced Twitter search, Brand24): monitor mentions in near-real-time
SpiderFoot HX and Recorded Future: commercial platforms for continuous automated OSINT with alerting

Quick recall

Google dork essentials: site:, filetype:, intitle:, inurl:, intext:, - to exclude. GHDB at exploit-db.com has thousands of proven patterns.
Bing indexes different pages; run key dorks on both. DuckDuckGo doesn't personalise, useful for unbiased results.
Wayback Machine: deleted content, old JS with exposed keys. waybackurls extracts every historically indexed URL for a domain.
GitHub: search for "api_key" "domain", "BEGIN RSA PRIVATE KEY". Git history preserves removed secrets; use TruffleHog or GitLeaks.
Document metadata (ExifTool, Metagoofil): author names, internal paths, GPS coordinates. EXIF GPS in images is geolocation for free.
Paste monitoring: site:pastebin.com "domain.com" plus Google Alerts. IntelX searches paste archives, dark web and breach data together.