Orchestrating AI Code Review at scale

Cloudflare's engineering teams wasted hours waiting for code reviews on merge requests.

Cloudflare’s engineering teams wasted hours waiting for code reviews on merge requests. Median first-review times stretched across projects. They fixed it with a CI-integrated AI system built around the open-source OpenCode agent. This setup deploys up to seven specialized AI reviewers—covering security, performance, code quality, documentation, release management, and internal Codex compliance—on every pull request. A coordinator agent merges their output, removes duplicates, assesses severity, and posts one structured comment. Across tens of thousands of merge requests, it auto-approves clean code, flags real bugs accurately, and blocks merges on serious vulnerabilities.

This matters because code review gridlock kills velocity in large orgs like Cloudflare, which manages thousands of repositories. Traditional reviews force context switches, nitpicks on naming, and back-and-forth cycles. Off-the-shelf AI tools fell short on customization for their scale. Naive prompts to LLMs on git diffs produced noise: hallucinated errors, vague advice like “add error handling” on already-robust functions. Cloudflare’s approach orchestrates specialists instead of a single bloated model, sidestepping common LLM pitfalls.

Architecture: Modular Plugins Over Monoliths

They avoided hardcoding for thousands of repos by leaning on plugins. The system triggers in CI/CD pipelines, pulling diffs and feeding them to OpenCode-based agents. Each specialist handles a narrow domain: security scans for vulns like injection flaws or secrets; performance checks CPU/memory hogs; quality enforces style without trivia. The coordinator uses judgment logic to prioritize—high-severity issues block merges outright.

Integration hooks into GitHub or internal GitLab, posting reviews as bot comments with structured JSON for easy parsing. Engineers see actionable diffs with severity scores. On clean runs, it skips human review entirely. Data from 10,000+ MRs shows it catches issues humans miss, like subtle race conditions, while avoiding false positives that annoy devs.

Challenges: LLMs in the Critical Path

Putting AI in CI/CD exposes harsh realities. Token limits cap context—Cloudflare chunks large diffs and summarizes. Hallucinations persist, so they tuned prompts with codebase-specific examples and fine-tuned rejection thresholds. Latency hits: seven agents plus coordination add seconds to minutes per MR, but parallel execution keeps it under 2 minutes median. Cost scales with usage; at Cloudflare’s volume, they optimize model sizes—smaller for nits, larger for security.

Security risks loom large. AI reviewing code could leak sensitive diffs to external APIs, so everything runs air-gapped on internal infra with approved models. Compliance demands traceability: every review logs inputs/outputs for audits. They iterated through false blocks—early versions halted merges on stylistic gripes—refining severity models with human feedback loops.

Skeptically, claims of “impressive accuracy” lack public metrics. No precision/recall numbers shared, and “blocks serious problems” begs verification. Yet internal deployment at scale suggests it works better than alternatives. Humans still override AI calls, maintaining the loop.

Implications extend beyond Cloudflare. In security-heavy fields like crypto or finance, automated vuln detection pre-merge slashes exposure. Devs ship faster—hours to minutes—without quality drops. OpenCode’s open-source nature invites replication, but expect tuning pains. Cloudflare ties this to “Code Orange: Fail Small,” their resiliency push, proving AI augments, not replaces, engineers. For orgs bottlenecked by reviews, this blueprint cuts drag while hardening code.

Bottom line: viable AI code review demands orchestration, not hacks. Cloudflare proves it scales, but demands investment in infra, tuning, and safeguards. Others chasing velocity should study the plugins-first design.

Orchestrating AI Code Review at scale

Architecture: Modular Plugins Over Monoliths

Challenges: LLMs in the Critical Path

Related

The AI engineering stack we built internally — on the platform we ship

Creusot 0.11.0: VerifyThis winner

bpfvet: analyzes compiled .bpf.o files and reports minimum kernel version, helpers, maps, and portability issues

Diagnosing Random MariaDB Freezes

Show HN: Prompt-to-Excalidraw demo with Gemma 4 E2B in the browser (3.1GB)

The Missing Bundler Features