Search engine dorking and web archives
Google and Bing dork operators, the Wayback Machine, paste sites, GitHub secret scanning and document metadata extraction.
~6 min read
Search engine operators (dorking)
Web search engines support advanced operators that narrow results to specific sites, file types, page sections and content patterns. When used deliberately to find sensitive or exposed information, this is called Google dorking or search engine dorking. The technique requires no tools beyond a browser and no interaction with any target system, but it can surface exposed admin panels, configuration files, credential leaks, directory listings and version-disclosing error messages.
Google operator reference
| Operator | Example | What it finds |
|---|---|---|
site: |
site:example.com |
All indexed pages on a domain |
site: with path |
site:example.com/admin |
Pages under a specific path |
filetype: |
filetype:pdf site:example.com |
Specific file extensions |
intitle: |
intitle:"index of" |
Pages with text in the <title> |
inurl: |
inurl:admin |
URLs containing a string |
intext: |
intext:"password" |
Pages with text in the body |
cache: |
cache:example.com |
Google's cached copy of a page |
- (minus) |
site:example.com -www |
Exclude results |
"quotes" |
"database error" |
Exact phrase match |
OR |
site:example.com OR site:example.net |
Either term |
* |
"admin * password" |
Wildcard for any word |
before: / after: |
after:2024-01-01 site:example.com |
Date-bounded results |
High-value dork patterns
These combinations regularly surface sensitive information in CTFs and real-world assessments:
site:example.com filetype:pdf # all PDFs (check for metadata)
site:example.com filetype:xml OR filetype:conf OR filetype:log
intitle:"index of" site:example.com # open directory listings
inurl:phpinfo.php site:example.com # PHP info pages (leaks server config)
site:example.com inurl:".git" # exposed .git directories
site:example.com intext:"sql syntax near" # SQL error messages
"BEGIN RSA PRIVATE KEY" site:github.com # private keys in public repos
site:pastebin.com "example.com" # pastes referencing the target
inurl:wp-admin site:example.com # WordPress admin interfaces
site:example.com intext:"Index of /" intitle:"Index of"
"X-Forwarded-For" filetype:log # log files leaking headers
The Google Hacking Database (GHDB) at exploit-db.com/google-hacking-database catalogues thousands of working dorks categorised by type: files containing passwords, sensitive directories, error messages, exposed login portals. It is a reference worth bookmarking before any assessment.
Bing and DuckDuckGo
Bing supports site:, filetype:, inurl: and intitle: with similar syntax and indexes different content from Google. Running the same dork pattern on Bing after Google sometimes surfaces results that Google's algorithm has de-prioritised, particularly older pages.
DuckDuckGo supports site: and filetype: and does not personalise results, which is useful when you want a view unclouded by your search history.
The Wayback Machine
The Wayback Machine (web.archive.org) archives web snapshots going back to 1996. For OSINT, it provides access to content that no longer exists on the live site:
- Deleted configuration files, employee directories and product pages
- Previous versions of a page before a cleanup
- Sites that have been taken down entirely
- Old JavaScript files containing hardcoded API keys or endpoints that were removed from the live version
Enter a URL to see the calendar of archived snapshots. To browse all URLs ever captured for a domain, use the wildcard:
https://web.archive.org/web/*/example.com/*
Waybackurls is a CLI tool that extracts all historically indexed URLs for a domain from the Wayback Machine, Common Crawl and URLScan:
waybackurls example.com | sort -u
This surfaces endpoints and paths that don't appear in the current sitemap: development routes, deprecated APIs, parameter names that reveal application structure.
URLScan.io (urlscan.io) stores the results of URL scans submitted by security tools. Searching urlscan for a domain reveals historical scans, linked resources, JavaScript endpoints and external requests, giving another timeline of how the site has looked.
Paste sites
When credentials, source code, configuration data or API keys leak, they frequently appear on paste sites, ephemeral text-hosting services where anyone can share text anonymously.
Key sources to check and monitor:
- Pastebin (pastebin.com): the largest and most searched. Monitor with:
site:pastebin.com "targetcompany.com" - GitHub Gist (gist.github.com): public gists are findable. Developers often paste credentials as "secret" gists while troubleshooting, not realising "secret" means only obscured, not private.
- ControlC (controlc.com) and Hastebin: alternative services with different indexing
- Intelligence X (intelx.io): searches Pastebin, dark web pastes, leaked breach databases and many other sources simultaneously; the most efficient single search for paste exposure
Automating paste monitoring: set Google Alerts for site:pastebin.com "[target]", or use PasteHunter for self-hosted automated collection.
GitHub OSINT
GitHub and similar code hosts are a major source of accidentally exposed secrets. Developers commit .env files, config files with API keys, hardcoded passwords and private keys, then may push a "removal" commit not realising the full git history remains publicly accessible.
GitHub code search operators:
"example.com" password # credentials mentioning the domain
"BEGIN RSA PRIVATE KEY" # RSA private keys
"api_key" "example.com" # API keys with domain context
user:johndoe filename:.env # .env files in a specific user's repos
org:acmecorp filename:config.yml # config files in an organisation
extension:sql password # SQL dumps containing passwords
"DB_PASSWORD" "example.com" # database credentials
"aws_access_key_id" # AWS credentials
"Authorization: Bearer" # hardcoded tokens
The git history is not deleted by a commit that removes a file. Even if a secret is removed in a later commit, it persists in the history. View it with:
git log --oneline --all
git show <commit-hash>
Automated secret-scanning tools:
- TruffleHog: scans git history for high-entropy strings and known credential patterns:
trufflehog git https://github.com/org/repo - GitLeaks: similar, with a comprehensive rule set for common service credential formats
- GitHound: searches GitHub at scale for a target's credentials across all public repositories, not only their own
Document metadata
Documents published online carry invisible metadata: the author's full name, organisation, software version used to create it, internal file path (which may reveal internal server names), and editing history. PDF, Word, Excel and PowerPoint files are the most common carriers.
ExifTool extracts metadata from any file type:
exiftool document.pdf # PDF metadata: author, creator application, dates
exiftool image.jpg # photo metadata including GPS coordinates
exiftool -r ./documents/ # recursive extraction from an entire directory
For harvesting documents from a target website automatically:
- Metagoofil: uses Google to find and download documents from a target domain, then extracts metadata from all of them:
metagoofil -d example.com -t pdf,doc,xls -l 50 -o ./results - FOCA (Windows GUI): similar, with a graphical interface that visualises metadata relationships and inferred internal infrastructure
GPS coordinates in photos are common in CTF challenges. Images taken on smartphones embed coordinates into EXIF by default unless the camera app is configured to strip them. ExifTool reveals them in seconds.
Monitoring and alerting
For ongoing intelligence against a target:
- Google Alerts (google.com/alerts): email notifications for new results matching a query. Set
"example.com" breachor"Acme Corporation" credentialsto be alerted when fresh material appears. - Social media monitoring tools (Tweetdeck/advanced Twitter search, Brand24): monitor mentions in near-real-time
- SpiderFoot HX and Recorded Future: commercial platforms for continuous automated OSINT with alerting
Quick recall
- Google dork essentials:
site:,filetype:,intitle:,inurl:,intext:,-to exclude. GHDB at exploit-db.com has thousands of proven patterns. - Bing indexes different pages; run key dorks on both. DuckDuckGo doesn't personalise, useful for unbiased results.
- Wayback Machine: deleted content, old JS with exposed keys.
waybackurlsextracts every historically indexed URL for a domain. - GitHub: search for
"api_key" "domain","BEGIN RSA PRIVATE KEY". Git history preserves removed secrets; use TruffleHog or GitLeaks. - Document metadata (ExifTool, Metagoofil): author names, internal paths, GPS coordinates. EXIF GPS in images is geolocation for free.
- Paste monitoring:
site:pastebin.com "domain.com"plus Google Alerts. IntelX searches paste archives, dark web and breach data together.