BTC
ETH
SOL
BNB
GOLD
XRP
DOGE
ADA
Back to home
Tech

Miasma: A tool to trap AI web scrapers in an endless poison pit

AI companies scrape the web at massive scale to train models like GPT-4, but site owners pay the bandwidth bill and lose control over their data.

AI companies scrape the web at massive scale to train models like GPT-4, but site owners pay the bandwidth bill and lose control over their data. Enter Miasma: a Rust-based tool that detects scrapers and traps them in an infinite loop of generated nonsense, wasting their compute and poisoning datasets. Released on Hacker News last week, it fingerprints bots via rate limits, user agents, and behavior, then serves endless “poison” pages. This isn’t just defense—it’s retaliation.

Web scraping exploded with AI’s hunger for data. Cloudflare’s 2023 report pegs scrapers at 30-50% of some sites’ traffic, up from 10% pre-LLM boom. OpenAI, Anthropic, and others hoover billions of pages without permission, ignoring robots.txt in 80% of cases per studies from Imperva. Site owners face real costs: a mid-sized blog might burn $500/month on extra server load. Worse, scraped data fuels models that regurgitate copyrighted content, sparking lawsuits like The New York Times vs. OpenAI in December 2023, alleging $100M+ in damages.

The Scraping Problem in Numbers

Quantify the assault: Bright Data tracks 1.2 billion scraping requests daily across top sites. AI bots mimic humans poorly—headless Chrome leaves fingerprints like missing WebGL support or uniform request intervals. Traditional defenses fail. CAPTCHAs annoy users, rate limiting starves legit traffic, and legal threats move slow. robots.txt? A gentleman’s agreement bots laugh at; Common Crawl indexes 3 petabytes yearly, much from disallowed paths.

This matters because scraping undermines the open web. Publishers can’t monetize without walls, yet walls kill SEO. AI firms train on your content for free, then compete with you via tools like ChatGPT plugins. Small devs and indie sites suffer most—no lawyers for DMCA takedowns.

How Miasma Traps and Poisons

Miasma runs as a reverse proxy or standalone server. It watches for red flags: high request rates (e.g., 100/min from one IP), shady user agents like “Mozilla/5.0 (compatible; Googlebot/2.1)”, or JavaScript evasion fails. Suspect? Redirect to a tarpit.

The tarpit generates infinite content on-the-fly. Fetch /article/123, get 10KB of SEO-stuffed gibberish about “quantum blockchain synergies.” Click “next page”? Another 10KB variant. Bots chase links forever, burning cycles. Author Nils Adermann tuned it to mimic real sites: dynamic titles, meta tags, even fake author bios. Rust ensures low overhead—handles 10k req/s on a $5 VPS.

Setup is straightforward. Clone from GitHub, tweak config:

[miasma]
bind = "0.0.0.0:8080"
poison_dir = "/var/www/poison"

[[rules]]
user_agent = "GPTBot"
action = "tarpit"

[[rules]]
requests_per_min = 50
action = "tarpit"

Deploy behind Nginx or Caddy. Tests show it chews 1-2GB/hour per trapped bot, vs. normal page serving.

Implications: Arms Race or Web Savior?

Miasma levels the field. Indie sites deploy it free, retaliating against trillion-dollar labs. Poison data degrades models—feed GPT enough “Florida Man invents NFT-powered fusion reactor” drivel, and outputs get noisier. Early adopters on HN report 40% traffic drop post-install, with scrapers vanishing after tarpits.

Skeptical take: It escalates the arms race. Sophisticated scrapers adapt—polite distributed crawls, VPN rotation, better browser emulation. Perplexity.ai already uses human-like fingerprints. Legal gray area too: serving infinite data might skirt abuse laws, but could invite DDoS claims if bots overwhelm. Measure success? Track bandwidth savings and model hallucination spikes, but causality’s fuzzy.

Why this matters: The web’s sustainability hangs on fair data access. Without tools like Miasma, AI eats the ecosystem alive. Pair it with Cloudflare Workers for scale, or o1 for custom fingerprints. Deploy now if you’re scraped—it’s open-source revenge.

March 30, 2026 · 3 min · 12 views · Source: Hacker News

Related