Babeling the Bots: Turning Scraper Attacks into a Game of Cat and Mouse

The Problem: Scrapers as Silent DDoS

Web scrapers, especially those that blindly chase every link, can unintentionally flood a site with traffic that looks like a distributed denial‑of‑service attack. Small blogs and niche services have started asking for guidance on how to protect themselves from this unintentional abuse.

The author’s earlier posts focused on the defensive side—blocking bad requests with a 403. In this installment, he flips the script and explores fighting back.

"The real enemy are the bots that scrape with malicious intent," he writes. "I get hundreds of thousands of requests for .env, .aws, and all the different .php paths that could signal a misconfigured WordPress instance."

From Theory to Rust: Building a Markov‑Chain Babbler

The idea came from a paper about feeding scrapers endless streams of junk data with a Markov‑chain babbler. The author dove into the mathematics, learned Rust, and built a generator that can be trained on any corpus.

"I trained my Markov chain on a few hundred .php files, and set it to generate. The responses certainly look like PHP at a glance, but on closer inspection they're obviously fake," he notes.

A sample 1 KB output looks like this:

<?php
wp_list_bookmarks() directly, use the Settings API. Use this method directly. Instead, use `unzip_file() {
return substr($ delete, then click &#8220; %3 $ s object. ' ), ' $ image
*
*
*
* matches all IMG elements directly inside a settings error to the given context.
* @return array Updated sidebars widgets.
* @param string $ name = "rules" id = "wp-signup-generic-error" > ' . $errmsg_generic . ' </p> ';
}

The goal was twofold: waste the bot’s resources with large, fake files and lure the human operator into a time‑consuming rabbit hole.

The Efficiency Battle: Static Content to the Rescue

Serving 1 MB files from a VPS pushed response times into the hundreds of milliseconds, stressing the server. The author realized the most efficient way to serve data is as static content.

He built a lightweight server that loads the full text of Frankenstein into memory, then serves random paragraphs on each request. Each “post” links to five others, creating a breadth‑first crawl that quickly saturates bots.

"You can see it in action here: https://herm.app/babbler/" – the site is a playground, not a production deployment.

The same technique was applied to PHP files, with a static server at https://herm.app/babbler.php.

Risks and Trade‑offs

The author warns that this approach can backfire. Even with robots.txt, nofollow, and noindex, search engines might crawl the fake endpoints and flag the site as spam. This could hurt Google rankings and trigger browser warnings.

"If you or your website depend on being indexed by Google, this may not be viable," he cautions.

For projects that are not heavily indexed, the babbler can be a playful deterrent against malicious scrapers.

The Hidden Hook

As a compromise, the author added a hidden link on his blog to lure bad bots:

<a href="https://herm.app/babbler/" rel="nofollow" style="display:none">Don't follow this link</a>

This subtle bait keeps the site friendly to legitimate crawlers while frustrating the rest.

Final Thoughts

The experiment showcases how a deep understanding of bot behavior, combined with algorithmic creativity, can turn a defensive problem into a playful one. It also reminds us that every countermeasure carries trade‑offs—especially when search engines are involved.

"Not all threads need to lead somewhere pertinent. Sometimes we can just do things for fun," the author concludes, hinting that the line between utility and entertainment is thinner than it seems in the web‑dev world.

Source: https://herman.bearblog.dev/messing-with-bots/

#MarkovChains #WebScraping #Rust