Over 340 U.S. local news sites, many owned by major chains, have added the Internet Archive’s crawlers to their robots.txt files. Publishers cite concerns that AI firms could train models on archived articles without compensation. While the Archive is tightening abuse controls, the blocks threaten long‑term preservation of local journalism, prompting calls for better archiving solutions and licensing frameworks.
Over 340 local news outlets are limiting the Internet Archive’s access to their journalism
By Andrew Deck and Hanaa’ Tameez – May 20, 2026
What’s being claimed?
Nieman Journalism Lab reports that more than 340 local news websites across the United States have added the Internet Archive’s user‑agents to their robots.txt files, effectively preventing the Wayback Machine from crawling and preserving their articles. The move is framed as a pre‑emptive defense against AI companies that might scrape archived content for training large language models (LLMs) without paying the publishers.
What’s actually new?
Scale of the block – The latest crawl of
robots.txtfiles (using the methodology described in the Lab’s January report) shows 382 news sites now disallow at least one Archive‑related bot, 342 of which are local outlets. This is a sizable jump from the 241 sites identified five months earlier.Who is blocking?
- The majority belong to the five largest local‑news owners: USA Today Co. (formerly Gannett), McClatchy, Advance Local, MediaNews Group (Alden Global Capital), and Tribune Publishing (also owned by Alden).
- Specific examples include the Cleveland Plain Dealer, The Patriot‑News, The Oregonian, The Mercury News, The Denver Post, and The New York Daily News.
Bots in question – The analysis tracks several user‑agents that publishers associate with the Archive:
Heritrix,my‑heritrix‑crawler,heritrix/3.3.0,Archive‑It,archive.org_bot,ia_archiver‑web.archive.org, andSpecial_archiver. The Wayback Machine itself has confirmed it does not use theia_archiver*variants, but publishers block them anyway because they appear in public bot‑lists.Publisher rationale – Statements from Advance Local, Alden’s MediaNews Group, and Condé Nast cite “protecting the value of published work” and “preventing unfair third‑party use.” Alden’s editorial from July 2025 explicitly links the block to ongoing litigation against OpenAI and Microsoft.
Technical response from the Archive – The Internet Archive has rolled out rate‑limiting for bulk downloads and partnered with Cloudflare to monitor suspicious bot traffic. Founder Mark Graham says the organization is in “conversation with many publishers” and stresses that the Archive’s terms of use restrict data to scholarly or research purposes.
Why does it matter?
Preservation of local journalism
Local news is already under severe pressure: newsroom cuts, paywall closures, and the loss of physical archives have left many stories at risk of disappearing. Researchers, historians, and even working journalists rely on the Wayback Machine to retrieve articles that have been taken down or whose original sites have gone dark. As Edward McCain, a journalism librarian at the University of Missouri, warns, blocking the Archive weakens a vital link in the primary‑source chain.
Impact on AI training data
The fear driving the blocks is that AI developers could harvest the Archive’s collections, train LLMs, and then generate answers that cite the original outlet without attribution or compensation. While no publisher has yet proven that an AI firm has actually scraped their content from the Wayback Machine, the potential legal exposure—especially in light of the ongoing OpenAI‑news‑publisher lawsuit—appears to outweigh the speculative benefit of broader distribution.
Workarounds and their limits
Some outlets, like the Baltimore Banner, have adopted a nuanced approach: they block the Archive’s generic bots but allow crawlers used by specific AI services (e.g., ChatGPT, Claude) that can be forced to provide attribution links. This strategy hinges on the assumption that the AI provider will honor the request for citation, which is not guaranteed under current licensing norms.
Limitations and open questions
| Issue | Current state | Open question |
|---|---|---|
| Legal clarity | No definitive court ruling on whether Archive‑derived content can be used for LLM training without a license. | Will the pending copyright suit set a precedent that forces all archives to negotiate licensing? |
| Technical enforcement | Publishers rely on robots.txt, which is a voluntary standard; crawlers can ignore it. |
How effective are the blocks in practice against sophisticated scraping tools? |
| Alternative archiving | Commercial services like ProQuest, LexisNexis, and emerging nonprofit initiatives (e.g., the Poynter‑Internet Archive training program) exist but are not free. | Can a sustainable, low‑cost archiving model be built for small local papers? |
| Attribution mechanisms | AI providers claim they will surface source URLs when possible, but enforcement is weak. | What technical standards could ensure reliable citation from model outputs? |
What can be done?
- Standardized licensing for archival use – A lightweight, machine‑readable license (e.g., a CC‑style tag) that distinguishes “research‑only” from “commercial‑AI‑training” could give publishers more control while keeping the content discoverable.
- Collaborative preservation grants – The Internet Archive’s partnership with the Poynter Institute and IRE to train 300 newsrooms by 2027 is a step forward, but funding must cover storage costs and metadata curation for smaller outlets.
- Technical safeguards – Deploying fingerprinting (e.g., content hashes) could allow publishers to track whether their articles appear in AI training corpora, providing evidence for future negotiations.
- Policy advocacy – Press freedom groups and library associations should lobby for fair‑use clarifications that protect archival preservation while allowing reasonable commercial use.
Photo of Internet Archive servers (credit: Scott Beal / Laughing Squid)
Bottom line
The surge in robots.txt blocks reflects a growing tension between open‑access preservation and emerging AI economics. While the Internet Archive is taking steps to curb abuse, the loss of systematic, free web archiving for local news could create a blind spot in the historical record. A balanced solution will likely require legal clarity, technical standards for attribution, and modest public funding to keep local journalism accessible to future generations.

Comments
Please log in or register to join the discussion