Cloudflare Accuses Perplexity AI of Masked Scraping Campaign Violating Robots.txt Directives
Share this article
Cloudflare has exposed what it describes as systematic efforts by Perplexity AI to circumvent website owners' content protection measures. According to the CDN provider's investigation, Perplexity deployed masked crawlers that impersonated human users to scrape content from domains that explicitly blocked its bots through both robots.txt directives and custom firewall rules.
Evasion Tactics in Action
When websites blocked Perplexity's official crawlers (PerplexityBot and Perplexity-User) via robots.txt and Web Application Firewall (WAF) rules, the AI startup allegedly rotated through unregistered IP addresses across multiple Autonomous Systems. This IP spoofing technique allowed Perplexity to bypass technical barriers while mimicking regular Chrome browser traffic from macOS devices—effectively disguising automated scraping as human visits.
Cloudflare's tests revealed this activity spanned "tens of thousands of domains and millions of requests per day." When queried about scraped content, Perplexity's AI generated detailed responses confirming successful data extraction from protected sources.
Defensive Countermeasures Deployed
In response, Cloudflare has implemented three key protections:
1. Bot Fingerprinting: Enhanced detection of Perplexity's disguised user agents
2. Challenge Protocols: CAPTCHA-style verification allowing human access while blocking bots
3. Managed Ruleset: New signatures added to Cloudflare's free AI crawler blocking tool
"Customers who already block AI crawlers are automatically protected against this activity," Cloudflare stated, while emphasizing that OpenAI—though facing copyright lawsuits—respects robots.txt restrictions.
The Scraping Ethics Divide
This incident highlights growing tensions between AI firms needing training data and publishers' rights. Cloudflare recently launched its "Pay Per Crawl" program, enabling publishers to monetize AI scraping—mirroring licensing deals between media giants like Gannett and AI companies. Yet Perplexity's alleged tactics suggest some players prioritize data acquisition over ethical compliance.
As publishers increasingly weaponize WAF configurations against unauthorized scraping, the industry faces fundamental questions about AI's relationship with content provenance. When language models absorb knowledge without consent, they risk poisoning their own training wells through adversarial data collection practices.
Source: Steven Vaughan-Nichols, ZDNET