Running out of Disk Space in Production

A developer launched a simple download server for 2.2GB digital files on a Hetzner box with just 40GB disk and 4GB RAM running NixOS.

A developer launched a simple download server for 2.2GB digital files on a Hetzner box with just 40GB disk and 4GB RAM running NixOS. Minutes after announcing availability, hundreds of customers hit the site. Disk filled to 100% instantly. Email deliveries failed with “452 4.3.1 Insufficient system storage.” Users couldn’t download files. Grafana and df -h confirmed /dev/sda at capacity. This exposed classic production pitfalls: analytics bloat, Nix store explosion, and overlooked log growth under load.

The Barebones Setup

The server ran a Haskell program serving static files behind authorization checks, fronted by nginx as a reverse proxy for a virtual host. Hetzner’s CX11 plan—€3.29/month—delivers that spec: single-core AMD EPYC, 40GB NVMe SSD. Fine for low-traffic prototypes, but tight for bursts. NixOS managed the stack declaratively, including Plausible Analytics for privacy-focused tracking. Plausible relies on ClickHouse, a columnar database that chews disk on queries and logs.

No cloud storage for the 2.2GB files. Everything sat local. Smart for control and cost, but risky. One file equals 5.5% of total disk. Hundreds downloading simultaneously? Cache misses and temp files pile up fast.

Traffic Spike Triggers Cascade

Logs flooded: Mar 31 20:43:03 mogbit kanjideck-fulfillment[2528300]: user error (Unexpected reply to: MAIL "...@kanjideck.com", Expected reply code: 250, Got this instead: 452 "4.3.1 Insufficient system storage"). SMTP rejected outbound mail. Incoming complaints likely dropped too. Disk at 40GB/40GB.

du -sh scans revealed culprits: /var/lib Plausible’s ClickHouse database at 8.5GB, /nix/store at 15GB with generations of configs and binaries. That leaves ~16GB unaccounted—author later realized “rest of files” couldn’t explain it. Likely suspects: journald logs exploding from request volume, nginx access logs, Haskell temp files, or ClickHouse write-ahead logs.

Panic mode. First fix attempt:

nix-collect-garbage -d

to nuke old profiles and store. Failed: error: opening lock file '/nix/var/nix/profiles/system.lock': No space left on device. Nix needs breathing room to run GC—ironic for a reproducibility tool.

Desperate Space Hunt

Quick win:

journalctl --vacuum-time=1s

. systemd’s journal freed enough for Nix GC to proceed, reclaiming 15GB. Next: Shrink ClickHouse. Plausible logs every query to system.query_log. Attempt:

clickhouse-client -q "TRUNCATE TABLE system.query_log"

Failed: Code: 243. DB::Exception: Cannot reserve 1.00 MiB, not enough space. ClickHouse buffers aggressively; even TRUNCATE needs temp space.

Real fix likely involved stopping services, rm -rf large dirs, or clickhouse-client --query="SYSTEM DROP REPLICA ..." on logs. Restart analytics, or offload to external. For files, migrate to S3-compatible like Hetzner Storage Box (€3/month for 100GB) or Backblaze B2 (pennies per GB).

Why This Bites—and How to Avoid

Small instances tempt startups: low cost, quick spin-up. But 40GB caps real-world use at ~20GB safe headroom after OS/Nix overhead. Plausible’s ClickHouse grows 1-5GB/month on modest traffic; spikes multiply it. Nix/store hits 10-20GB easy with iterations. Add logs: game over.

Implications hit hard for indie devs. Downtime erodes trust—KanjiDeck customers waited months for files, then 503s. Revenue risk if paid. Scale math: 100 users x 2.2GB = 220GB theoretical transfer; reality needs bandwidth too (Hetzner CX11 caps 20TB/month outbound).

Fixes: Alert on 80% disk via Prometheus/Grafana (they had Grafana—use it). Separate concerns: static files to CDN/object store. Run analytics on beefier box or hosted (Plausible Cloud starts €5/month). NixOS trick: nix.gc.keep-derivations=false in config, auto-GC more aggressively. Or ditch Nix for Docker if purity isn’t critical.

Skeptical take: Tools like NixOS and Plausible prioritize ideals—reproducibility, privacy—over pragmatism. Great for side projects; production demands ruthless optimization. Monitor everything. Size disks for peaks, not averages. One launch-day disk OOM, and your MVP is DOA.

Running out of Disk Space in Production

The Barebones Setup

Traffic Spike Triggers Cascade

Desperate Space Hunt

Why This Bites—and How to Avoid

Related

Proton Meet Isn’t What They Told You It Was

Artemis II’s toilet is a moon mission milestone

SSH certificates: the better SSH experience

Adobe wrote to my hosts file

800 Rust terminal projects in 3 years

Significant progress made on Xbox 360 recompilation