#Trends

When Defending Against AI Crawlers Means Locking Out Real People

Tech Essays Reporter
5 min read

A blog owner's block notice doubles as a quiet report from the front line of a problem that has crept up on the small web: the cost of staying online is now partly paid in friction imposed on legitimate readers who happen to browse from the wrong IP address.

Chris Siebenmann, a system administrator at the University of Toronto who has run the long-lived blog Wandering Thoughts for nearly two decades, recently published something that is not really a blog post at all. It is an error page. If you reach it, you have been turned away. The explanation is plain: as of late 2025 he has started blocking mainstream browsers that connect from cloud, VPS, and server provider networks, because those address ranges have become the home of a flood of high-volume crawlers, many of them apparently harvesting text to train large language models. The page exists to catch the rare human who gets swept up in a filter aimed at machines.

The core argument embedded in that notice is uncomfortable, and it is worth taking seriously precisely because Siebenmann is not a reflexive opponent of automation. His position is that the behavior of the current generation of scrapers has shifted the economics of running a small site to the point where a blunt instrument is the only practical defense. The crawlers claim to be Firefox or Chrome. They rotate through enormous IP pools. They ignore the polite conventions, the robots.txt directives and the rate limits, that earlier generations of well-behaved bots respected. When a system cannot distinguish a legitimate request from an abusive one by inspecting the request itself, it falls back to inspecting where the request comes from. And servers, by definition, live in server networks.

The supporting evidence for this being a widespread condition rather than one administrator's bad week has been accumulating across the independent web. Other operators have documented traffic patterns where automated agents account for the overwhelming majority of requests, sometimes spiking by orders of magnitude over a period of months. Projects like Anubis, a proof-of-work gate designed specifically to make mass scraping expensive, and commercial offerings such as Cloudflare's AI crawler controls, have emerged in direct response to the same pressure. The fact that a market for countermeasures now exists tells you the load is real and that the affected parties range from a personal wiki to infrastructure companies serving a meaningful slice of the internet.

What makes Siebenmann's notice instructive is the implication he is honest about rather than the defense he has chosen. Blocking by network origin is a heuristic that trades precision for tractability. It works because almost nobody browses from a data center, and it fails for the same reason: the people who do browse from a VPS or a cloud IP, whether they are running a personal proxy, working from a corporate egress gateway, using a privacy service, or simply living somewhere their consumer ISP routes oddly, are now collateral. The block page is an admission that the filter has a known false-positive rate, and the offered remedy, emailing the operator with your IP and exact User-Agent string so you can be allowlisted by hand, is a return to a pre-industrial mode of access control. One human vouches for another, one address at a time.

There is a deeper pattern here that connects this small episode to the architecture of the open web. For most of its history, the web's accessibility rested on an implicit bargain: requests were cheap to serve because the population of clients, while large, was bounded by the number of actual humans and a manageable set of identifiable bots. Generative AI broke the assumption that demand for a page is roughly proportional to human interest in it. A single obscure post might be fetched thousands of times not because thousands of people want to read it, but because it is one more token source in a training corpus. When the cost of being read decouples from the value of being read, the party bearing the cost starts looking for a way to opt out, and the only levers available to a one-person operation are crude.

The counter-perspective deserves a fair hearing. From the vantage point of the crawler operators, the open web has always been, well, open, and indexing it is what made it navigable in the first place. Search engines crawled aggressively for years and we came to regard that as a public good, even a precondition for a site mattering at all. One could argue that LLM training is continuous with that tradition and that the real failure is technical: the absence of a widely adopted, enforceable signal for consent and rate, something more expressive than robots.txt and more reliable than a User-Agent header that anyone can forge. Efforts to define such mechanisms are underway, but standards move slowly and the scraping is happening now. There is also a fairness question that cuts the other way. Blocking entire network ranges punishes the many legitimate cloud users for the behavior of a few abusive operators, and it quietly assumes that real readers should browse from residential connections, a normative claim about how people ought to use the internet that the internet's own design never endorsed.

What lingers after reading the notice is the asymmetry it reveals. The entities generating the load are large, well-funded, and capable of routing around obstacles. The entity bearing the load is a single person who would rather be writing about Unix and system administration than maintaining a blocklist. The defense he has reached for is the one available to someone without a budget for traffic analysis or a contract with a CDN, and its visible cost is borne by individual readers who get a wall instead of a page. That distribution of burden, abundant resources on the extracting side and improvised friction on the publishing and reading sides, is the actual story. The block page is a small artifact of a larger renegotiation over who gets to read the web at scale, on whose terms, and at whose expense, and it is being settled for now not by policy or protocol but by thousands of individual operators deciding, one IP range at a time, that the open door has become too expensive to leave open.

Comments

Loading comments...