News Publishers Block Internet Archive to Prevent AI Training Data Scraping
#Security

News Publishers Block Internet Archive to Prevent AI Training Data Scraping

Trends Reporter
3 min read

Major news outlets including The Guardian and The York Times are restricting Internet Archive access, fearing their archived content could be scraped by AI companies for training data.

News publishers are increasingly restricting the Internet Archive's access to their content, citing concerns that AI companies might use the nonprofit's vast digital library to scrape training data without permission.

Publishers Take Action Against Archive Crawlers

The Guardian has implemented measures to limit the Internet Archive's access to its published articles, according to Robert Hahn, the outlet's head of business affairs and licensing. The publisher has excluded itself from the Internet Archive's APIs and filtered its article pages from the Wayback Machine's URL interface.

"A lot of these AI businesses are looking for readily available, structured databases of content," Hahn explained. "The Internet Archive's API would have been an obvious place to plug their own machines into and suck out the IP."

While The Guardian's regional homepages, topic pages, and other landing pages will continue to appear in the Wayback Machine, the outlet has taken these steps to minimize the chance that AI companies might scrape its content through the nonprofit's repository of over one trillion webpage snapshots.

Featured image

The New York Times has gone further, confirming it's actively "hard blocking" the Internet Archive's crawlers and added the archive.org_bot to its robots.txt file in late 2025, disallowing access to its content. A Times spokesperson stated they want to ensure their intellectual property is "being accessed and used lawfully."

The Archive's Mission vs. AI Concerns

Internet Archive founder Brewster Kahle responded to these restrictions by saying that "if publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record." He suggested this could undermine the organization's work countering "information disorder."

However, the Archive has faced challenges with AI companies in the past. In May 2023, the Wayback Machine went offline temporarily after an AI company caused a server overload by sending tens of thousands of requests per second to extract text data from the nonprofit's public domain archives. The company eventually apologized and made a donation to the Archive.

Industry-Wide Trend Emerges

An analysis by Nieman Lab of 1,167 news websites found that 241 news sites from nine countries explicitly disallow at least one of four Internet Archive crawling bots. The majority of these sites (87%) are owned by USA Today Co. (formerly Gannett), with each Gannett-owned outlet disallowing the same two bots: "archive.org_bot" and "ia_archiver-web.archive.org."

USA Today Co. spokesperson stated the company has "consistently emphasized the importance of safeguarding our content and intellectual property" and introduced new protocols in 2025 to deter unauthorized data collection and scraping.

CEO Mike Reed reported in an October 2025 earnings call that the company blocked 75 million AI bots across its platforms in September alone, with about 70 million coming from OpenAI.

The Broader Context

The Internet Archive's commitment to preserving the web has made it a target as news publishers try to safeguard their content from AI companies. The Financial Times blocks any bot that tries to scrape its paywalled content, including bots from OpenAI, Anthropic, Perplexity, and the Internet Archive.

Michael Nelson, a computer scientist and professor at Old Dominion University, noted that "Common Crawl and Internet Archive are widely considered to be the 'good guys' and are used by 'the bad guys' like OpenAI. In everyone's aversion to not be controlled by LLMs, I think the good guys are collateral damage."

Despite these concerns, the Internet Archive does not currently disallow any specific crawlers through its robots.txt file, including those of major AI companies. However, after inquiries from Nieman Lab, the Archive changed its robots.txt language from "Welcome to the Archive! Please crawl our files" to a simpler "Welcome to the Internet Archive!"

The situation highlights the tension between the Archive's mission to democratize information and publishers' need to protect their intellectual property in an era where AI companies are aggressively seeking training data.

Comments

Loading comments...