Saturday, November 29, 2025

Commentary on Cloudflare's Perplexity AI Stealth Crawling Investigation

Cloudflare's August 2025 blog post represents one of the most technically rigorous public exposés of AI crawler misbehavior to date. The findings are significant not just for what they reveal about Perplexity specifically, but for what they demonstrate about the deteriorating trust infrastructure that has governed web crawling for three decades.

The Technical Evidence Is Damning

Cloudflare's methodology was meticulous. They created freshly purchased test domains that had never been indexed by any search engine and were not publicly discoverable in any way. These domains had explicit robots.txt directives blocking all automated access and WAF rules specifically targeting Perplexity's declared crawlers (PerplexityBot and Perplexity-User). Despite these comprehensive blocks, Perplexity was able to provide detailed information about the exact content hosted on these restricted domains.cloudflare

The forensic analysis revealed a dual-crawler system:

Crawler TypeUser AgentDaily Requests
DeclaredMozilla/5.0 AppleWebKit/537.36 (compatible; Perplexity-User/1.0)20-25 million
StealthMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/124.0.0.0 Safari/537.363-6 million

The stealth crawler impersonated a standard Chrome browser on macOS, used IP addresses outside Perplexity's official ranges, rotated through multiple Autonomous System Numbers (ASNs) to evade blocking, and accessed content "across tens of thousands of domains and millions of requests per day". This is not accidental behavior—it is a systematic evasion infrastructure.techcrunch+1

The Defense Doesn't Hold Up

Perplexity's response claimed Cloudflare has "a fundamental misunderstanding of how AI assistants work," arguing their requests are made on behalf of users in real-time, not as preemptive crawling—akin to what a browser or email client does. The company insists it shouldn't be "governed by rules designed for traditional web crawlers".searchengineland+1

This defense is problematic for several reasons:

First, Cloudflare's test domains were never publicly accessible. No user could have legitimately asked Perplexity to fetch content from a domain that didn't exist in any search index. The only way Perplexity could have accessed this content is through proactive crawling or by honoring user requests to access URLs that users somehow obtained—which still constitutes unauthorized access to clearly prohibited resources.

Second, the distinction between "user-initiated fetching" and "crawling" is largely semantic when the result is identical: a bot accesses content the publisher explicitly prohibited. RFC 9309, the IETF standard governing robots.txt, applies to "automatic clients known as crawlers". Whether the automation is triggered by a user request or a scheduled job doesn't change the fact that it's automated access to restricted content.datatracker.ietf

Third, this isn't Perplexity's first rodeo. In 2024, CEO Aravind Srinivas admitted to Fast Company that Perplexity wasn't directly ignoring robots.txt but was "using third-party scrapers that ignored it"—while declining to name the scraper or commit to stopping the practice. Forbes, Wired, and The New York Times have all raised similar complaints. At some point, a pattern of "misunderstandings" becomes willful conduct.businessinsider+3

This Violates Established Standards

The robots.txt protocol became an official IETF standard (RFC 9309) in September 2022, formalizing rules that had governed crawler behavior since 1994. The standard is explicit: crawlers are "requested to honor" the rules when accessing URIs. While technically advisory rather than mandatory, the RFC represents the consensus of the technical community about appropriate crawler behavior.searchengineworld+2

Cloudflare's post explicitly notes that Perplexity's conduct is "incompatible" with established preferences that crawlers should be transparent, serve a clear purpose, perform specific activities, and "most importantly, follow website directives and preferences".cloudflare

OpenAI Shows It Can Be Done Right

Cloudflare's comparison to OpenAI is instructive. When subjected to the same tests, ChatGPT-User fetched the robots file and stopped crawling when disallowed. When presented with a block page even without a robots.txt disallow directive, ChatGPT again stopped crawling with "no additional crawl attempts from other user agents". This demonstrates that respecting publisher preferences is technically feasible and that Perplexity's evasion represents a choice, not a necessity.cloudflare

The Broader Implications

This incident illustrates a troubling trend in the AI industry. Perplexity is valued at approximately $3 billion and has received backing from Jeff Bezos and Nvidia. Reddit has now filed a federal lawsuit against Perplexity alleging "industrial-scale" data theft. The New York Times sent the company a cease-and-desist notice in October 2024.reuters+3

The fundamental business model appears to depend on accessing content that publishers have explicitly prohibited. When Cloudflare successfully blocked the stealth crawler in tests, Perplexity's answers became "less specific and lacked details from the original content"—suggesting that the unauthorized scraping is essential to the product's value proposition, not incidental to it.cloudflare

Regulatory and Technical Responses

Cloudflare has taken several concrete actions:

  • Delisted Perplexity as a verified bot, categorizing it alongside unreliable actorspcmag

  • Added heuristics to managed rules that block stealth crawling behaviorcloudflare

  • Made protections available to all customers, including free tier userscloudflare

Over 2.5 million websites now use Cloudflare's managed tools to block AI crawlers. The company is also working with the IETF to standardize extensions to robots.txt that could carry AI-specific usage preferences.cyberpress+2

The Honor System Is Breaking

The web's crawling ecosystem has operated on trust for three decades. Search engines agreed to respect robots.txt in exchange for the privilege of indexing content. This worked because there was mutual benefit—publishers wanted search traffic, and search engines needed content.

AI answer engines break this compact. They extract the value of content (information) while potentially eliminating the need for users to visit the source (and see ads, subscribe, or otherwise support the creator). When the incentives diverge this sharply, and when billions of dollars are at stake, the honor system appears insufficient.

Cloudflare's investigation should serve as a wake-up call: the voluntary compliance model that has governed web crawling since 1994 may need enforceable legal and technical standards to survive the AI era. Publishers who want to protect their content need more than a politely-worded text file—they need infrastructure partners willing to actively detect and block bad actors, regardless of how well-funded or well-connected those actors may be.

  1. https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/
  2. https://techcrunch.com/2025/08/04/perplexity-accused-of-scraping-websites-that-explicitly-blocked-ai-scraping/
  3. https://searchengineland.com/cloudflare-vs-perplexity-ai-crawling-460016
  4. https://www.actuia.com/en/news/cloudflare-accuses-perplexity-of-using-stealth-crawlers-to-bypass-content-access-rules/
  5. https://datatracker.ietf.org/doc/html/rfc9309
  6. https://www.businessinsider.com/perplexity-ai-forbes-wired-explained-2024-6
  7. https://techcrunch.com/2024/07/02/news-outlets-are-accusing-perplexity-of-plagiarism-and-unethical-web-scraping/
  8. https://www.theverge.com/2024/6/27/24187405/perplexity-ai-twitter-lie-plagiarism
  9. https://www.reuters.com/technology/artificial-intelligence/nyt-sends-ai-startup-perplexity-cease-desist-notice-over-content-use-wsj-reports-2024-10-15/
  10. https://www.searchengineworld.com/rfc9309-robots-txt-quietly-became-an-official-internet-standard
  11. https://www.immwit.com/wiki/robots/
  12. https://www.forbes.com/sites/anishasircar/2025/10/23/would-be-bank-robbers-reddit-escalates-ai-data-wars-with-perplexity-lawsuit/
  13. https://www.reuters.com/world/reddit-sues-perplexity-scraping-data-train-ai-system-2025-10-22/
  14. https://www.pcmag.com/news/cloudflare-perplexity-ai-acts-like-north-korean-hackers-ignores-scraping
  15. https://cyberpress.org/cloudflare-claims-perplexity-ai/
  16. https://ca.finance.yahoo.com/news/exclusive-multiple-ai-companies-bypassing-143742268.html
  17. https://www.rfc-editor.org/rfc/rfc9309.html
  18. https://www.theregister.com/2025/08/04/perplexity_ai_crawlers_accused_data_raids/
  19. https://en.wikipedia.org/wiki/Robots.txt
  20. https://arstechnica.com/information-technology/2025/08/ai-site-perplexity-uses-stealth-tactics-to-flout-no-crawl-edicts-cloudflare-says/
  21. https://developer.mozilla.org/en-US/docs/Web/Security/Practical_implementation_guides/Robots_txt
  22. https://www.wired.com/story/perplexity-is-a-bullshit-machine/

No comments: