LLM-Dorking: Weaponizing Google-Style Search Operators for AI-Powered OSINT

A new open-source framework adapts classic Google dorking techniques for Large Language Models, enabling precise web reconnaissance through structured prompts. LLM-Dorking combines AI reasoning with search APIs for faster leak discovery, threat hunting, and competitive intelligence, revolutionizing how analysts query the digital landscape.

For decades, security researchers and OSINT professionals have relied on Google dorking – crafting precise search queries using operators like intext:, site:, and filetype: – to uncover exposed credentials, sensitive documents, and vulnerabilities. Now, a paradigm shift is emerging: LLM-Dorking adapts this powerful methodology for the age of Large Language Models (LLMs) like ChatGPT and Claude, merging structured search syntax with AI reasoning for unprecedented reconnaissance capabilities.

Beyond Keyword Queries: The LLM Dorking Engine

Traditional dorking relies on manual query construction and result parsing. LLM-Dorking supercharges this by:

Translating Intent: Converting human objectives into optimized search engine API calls (SerpAPI, Brave, Bing).
Intelligent Filtering: Using the LLM to post-process, summarize, score, and filter raw search results based on complex criteria.
Contextual Understanding: Leveraging the model's ability to interpret nuance, synonyms, and temporal context.

# Example core logic from run_query.py (simplified)
# 1. Load structured prompt (e.g., grants.txt)
# 2. Execute SerpAPI search using prompt-derived parameters
# 3. Feed results + original prompt to LLM for filtering/summarizing
params = {
    "q": "site:gva.es filetype:pdf 'subvención autónomos' after:2023-01-01",
    "api_key": API_KEY,
    "engine": "google"
}
search_results = requests.get("https://serpapi.com/search", params=params).json()
llm_response = openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[{"role": "system", "content": "Summarize and filter results."}, ...]
)

The LLM Dork Cheat Sheet: Old Operators, New Power

The framework provides direct translations for essential operators:

Google Dork	LLM Prompt Snippet	Purpose
`intext:`	"Find text that contains `password`"	Locate specific strings within page content.
`site:`	"Search only `site:github.com`"	Restrict searches to specific domains.
`filetype:`	"Restrict to PDF files mentioning `confidential`"	Target specific document formats.
`intitle:`	"Only return pages whose title includes `admin`"	Filter by page titles.
`before:/after:`	"Only include documents published after 2024-01-01"	Apply temporal filters.
`~synonym`	"Include synonyms for `configure` (set, change)"	Broaden searches semantically.

These can be combined with boolean logic ((site:github.com | site:gitlab.com) & intext:"API_KEY") within the LLM prompt for surgical precision.

Real-World Applications: Beyond Theory

The /prompts directory showcases practical use cases ready for deployment:

security.txt: Hunt exposed API keys & configs across code repositories and paste sites.
grants.txt: Discover EU and Spanish startup subsidies from official sources (site:boe.es, site:europa.eu), filtering for recent PDF/DOC files.
mna_targets.txt: Identify distressed companies for M&A by scanning for phrases like "concurso de acreedores" (bankruptcy) or "busca comprador" (seeking buyer) in local news and filings.
wildcard_dirs.txt: Find exposed open directories (index of /) containing specific media types (mp3, mp4, pdf).

Why this changes OSINT: "LLM-Dorking moves beyond simple keyword scraping," explains the project's inspiration. "It leverages the LLM's ability to understand the context of what it finds – assessing risk levels of a leaked credential snippet or extracting key details from a grant document – automating hours of manual analysis into seconds."

Building an Intelligence Pipeline

The minimal Python runner (run_query.py) provides a foundation, but the true power lies in integration:

Automation: Schedule recurring dorks via tools like n8n, emailing daily digests.
Agent Orchestration: Use LangChain to chain prompts, loop through keyword lists, and store enriched results in vector databases (Qdrant, Pinecone).
Real-Time Alerting: Configure Slack alerts for new leaks using rolling before:/after: windows.
Specialized Bots: Build micro-bots for niche tasks (e.g., scraping local news for specific liquidation phrases and estimating company distress signals).

Responsible Exploration Mandatory

The power of LLM-Dorking necessitates ethical considerations. The MIT license includes a clear directive: Hack responsibly. This tool excels at finding publicly accessible information, but its effectiveness underscores the critical need for organizations to:

Rigorously scan their own public footprints for accidental leaks.
Implement robust monitoring for sensitive data exposure.
Understand that AI lowers the barrier for sophisticated reconnaissance.

As LLMs become integral to the analyst's toolkit, frameworks like LLM-Dorking represent more than a technical novelty; they signal a fundamental shift in how we interrogate the vastness of the web – turning AI into the ultimate query language for the digital age. The era of intelligent, automated OSINT reconnaissance is here.