Building the foundation for running extra-large language models

Cloudflare engineered a custom stack to run extra-large language models like Moonshot's Kimi K2.5 three times faster on their Workers AI platform.

Cloudflare engineered a custom stack to run extra-large language models like Moonshot’s Kimi K2.5 three times faster on their Workers AI platform. They prioritize agentic workloads, where context balloons with each user interaction—system prompts, tools, prior messages, and code. This setup processes massive input tokens quickly while enabling fast tool calls. The result: high-performance inference at the edge, without users managing GPUs.

Why does this matter? Running XLMs demands balancing compute-hungry prefill (processing inputs into KV cache) against memory-bound decode (generating outputs). Mismatch them on one machine, and GPUs sit idle. Cloudflare disaggregates these stages across specialized hardware, squeezing efficiency from expensive silicon. In a market where Nvidia H100s cost $30,000+ each, such optimizations cut bills and scale globally on their 300+ data center network.

Hardware Tailored to Workloads

Cloudflare deploys varied GPU configs based on token patterns. Fanfiction generation? Short inputs, long outputs—favor decode speed. Summarization? Thousands of input tokens, brief outputs—prioritize prefill. Agents dominate Workers AI: a single request packs system prompts (10k+ tokens), tools, memory context providers (MCPs), and conversation history. Context grows 2-5x per turn, hitting 100k-500k tokens fast.

They use Nvidia H100s and A100s, but configs shift ratios. Compute-bound prefill leans on high-FLOPS setups; memory-bound decode needs fat bandwidth (H100s peak at 3.35 TB/s HBM3). Single-machine runs waste 50-70% of GPU cycles, per industry benchmarks from vLLM and TensorRT-LLM. Cloudflare’s edge deployment adds latency wins—queries hit models in under 50ms round-trip worldwide, versus centralized clouds’ 200ms+.

Prefill-Decode Disaggregation

Core innovation: split prefill and decode across machines. Prefill computes attention on inputs, builds KV cache (key-value states for autoregression). Decode streams tokens sequentially, fetching from that cache. On one GPU, prefill hogs cores while decode starves on memory; reverse wastes compute.

Disaggregation routes prefill to compute-optimized nodes (e.g., A100 clusters), then ships compressed KV cache to memory-optimized decode nodes (H100s). Network overhead? Cloudflare’s fabric handles it in microseconds via RDMA-like tech. Benchmarks show 2-4x throughput gains for long-context agents, aligning with papers from DeepMind and MosaicML.

Implementation skips standard frameworks like Hugging Face Transformers. They built custom runtimes in Rust and WebAssembly for Workers, integrating with their V8 isolates. KV cache quantization (4-bit) shrinks transfer sizes 75%, vital at edge scales. Early Kimi K2.5 runs hit 50 tokens/second decode, up from 15-20 on stock setups—verifiable via their playground.

Real-World Trade-offs and Skepticism

Gains aren’t free. Disaggregation adds orchestration complexity—cache versioning, failover, load balancing across 300 cities. Peak loads spike costs; a 70B model like Kimi idles at $5-10/hour per GPU. Cloudflare claims “serverless” pricing ($0.001-0.01 per 1k tokens), but agents with 100k contexts burn $0.10-1.00 per query. Scale to millions? Bills mount.

Fair props: open-source focus (Llama 3, Mixtral next) beats proprietary lock-in. Edge inference crushes latency for RAG agents pulling real-time data. But hype-check: 3x speedup is workload-specific; short prompts see less. Competitors like Grok API or Replicate match speeds centrally cheaper for bursts.

Bottom line: Cloudflare lays groundwork for XLMs in agents without infra headaches. Developers gain fast, global inference; enterprises dodge AWS/GCP bills. Watch for adoption— if Workers AI captures 10% of $50B inference market by 2026, it disrupts. Until then, test latencies yourself; numbers don’t lie.

Building the foundation for running extra-large language models

Hardware Tailored to Workloads

Prefill-Decode Disaggregation

Real-World Trade-offs and Skepticism

Related

A time travel debugger for WebAssembly

Compiling to Java as a target language

€54k spike in 13h from unrestricted Firebase browser key accessing Gemini APIs

The Boy That Cried Mythos: Verification is Collapsing Trust in Anthropic

Show HN: Libretto – Making AI browser automations deterministic

I Let Claude Opus Write a Chrome Exploit: The Next Model (Mythos?) Won’t Need My Help?