The Filtered Web: LLMs as Opaque Gatekeepers Threatening Open Internet Access

During a packed breakout session at this year's TPAC, web advocates dissected emerging threats to the open web posed by large language models (LLMs). Traditional search crawlers have long benefited sites by funneling traffic and ad revenue. In contrast, LLM crawlers--optimized for training data and query resolution--drive up hosting costs like a DDoS attack while delivering content directly to users, bypassing original sites. This traffic diversion imperils ad-dependent models sustaining independent media and other vital web sectors, with ripple effects on societal functions like informed discourse.

Main article image

The Black Box of AI-Mediated Browsing

Beyond economics, the core concern is opacity: users increasingly 'browse' via LLMs and agents, filtering raw web content through systems that reword, alter, or commercialize it. For simple queries like omelette recipes, distortions might be minor, mitigated by voluminous training data. But higher-stakes information invites risks like model poisoning, where adversaries taint datasets. Anthropic's recent research showed that just 250 malicious documents could backdoor LLMs from 600M to 13B parameters, enabling persistent manipulation.

Key Threats in Detail

1. Content Distortion

LLMs can hallucinate or poison outputs, swapping ingredients in recipes or worse--fabricating facts in news or technical docs. Detection grows harder as training data remains proprietary.

2. Stealth Monetization

AI agents handling tasks like travel bookings could layer on margins, mirroring app store cuts or ride-hailing surcharges. With services like ChatGPT's $200 tier running at losses, AI firms face acute pressure to monetize transactions opaquely.

3. Privacy Erosion

Natural-language interactions yield richer user profiles than keyword searches, supercharging surveillance already entrenched in tech giants' histories.

4. Ideological Injection

Most perturbing is the 'ideology dial' LLMs introduce. As Baldur Bjarnason notes in his post Poisoning for propaganda:

Placing an LLM in a process gives that process an “ideology dial” for whatever is produced with it. And, here's the kicker, that “dial” is controlled by (whoever runs) the organisation training the LLM.

Examples range from keyword censorship (Copilot dodging 'trans' in code) to subtler sentiment shifts. Company decisions--filtering racism out or, per Elon Musk, tuning 'anti-woke'--or external data poisoning for propaganda (e.g., Grok's biases) amplify this. Unknown training corpora hinder independent audits.

Trust Deficit in AI Gatekeepers

Users must trust AI vendors on accuracy, fairness, pricing, and neutrality--a tall order given precedents. Tech firms have wrapped commodities like taxis and meals in UIs for fat margins, while surveillance capitalism thrived. Critiques like Karen Hao's Empire of AI and Timnit Gebru's TESCREAL analysis reveal how profit cloaks as altruism in AI labs.

Developers face direct fallout: plummeting traffic demands new revenue streams (subscriptions, micropayments via Web Monetization) and crawler defenses (robots.txt tweaks, rate-limiting GPTBot). Yet these patch symptoms, not the mediation root. W3C efforts, informed by experts like Hidde de Vries, underscore needs for transparent standards preserving direct access.

In this LLM-filtered future, AI companies stand to harvest data, margins, and narrative control. Sovereign, non-profit models offer hope, but skepticism lingers. The open web's strength--unmediated, diverse sources--teeters, challenging technologists to safeguard user agency amid AI's ascent.

This article is based on Hidde de Vries' blog post (views his own), by the W3C AB member and web standards advocate.