llm-all-models-async 0.1

Simon Willison just shipped llm-all-models-async version 0.1. This plugin automatically registers async wrappers for any sync-only models in the LLM ecosystem. It solves a concrete pain point: Datasette’s LLM features, like datasette-enrichments-llm, demand async models, but plugins like Willison’s own llm-mrchatterbox run purely sync.

Why does this distinction matter? LLM plugins—over 50 of them as of late 2024—split into sync and async flavors. Async dominates for API calls to OpenAI, Anthropic, or Grok, where non-blocking I/O keeps web servers responsive. Sync plugins handle local inference: think Ollama, llama.cpp, or custom runners like mrchatterbox, which execute models directly on your hardware without network hops. Datasette, Willison’s SQLite-powered data app framework, enforces async for its plugin hooks to avoid blocking Uvicorn’s event loop.

The Problem in Practice

Willison hit this wall testing llm-mrchatterbox, a sync plugin for a custom local model. He couldn’t plug it into datasette-enrichments-llm, which uses LLMs to generate column suggestions, enrich rows, or summarize datasets on the fly. Datasette serves thousands of public instances—over 1,000 indexed in its registry—often with real-time data pipelines. Forcing sync models here stalls everything.

This isn’t niche. Local models cut costs (no $0.01–$0.10 per 1K tokens API fees) and boost privacy, running inference on your CPU/GPU without sending data to cloud providers. But Python’s async ecosystem, built on asyncio, rejects sync blockers outright. Rewriting 10–20 sync plugins for async? Tedious and error-prone.

How the Fix Works

Enter llm-all-models-async. Willison prompted Claude (Anthropic’s model) to generate it, then iterated. It scans LLM’s registered models at plugin load, wraps sync ones in a thread pool executor from concurrent.futures, and registers async equivalents with names like mrchatterbox-async.

Core mechanism requires a new LLM hook, which Willison added in LLM 0.30 (released hours ago). Here’s the essence:

import asyncio
from concurrent.futures import ThreadPoolExecutor
from llm import PluginModel

async def async_wrapper(sync_model: PluginModel, *args, **kwargs):
    loop = asyncio.get_event_loop()
    with ThreadPoolExecutor(max_workers=4) as executor:
        future = loop.run_in_executor(executor, sync_model.invoke, *args, **kwargs)
        return await future

LLM 0.30 exposes llm.hooks.register_async_model_wrapper(), letting plugins like this one inject wrappers systematically. Install via pip install llm-all-models-async, and sync models appear as async options in llm models list.

Test it: Fire up Datasette with datasette-enrichments-llm, point it at mrchatterbox-async, and enrich a CSV of financial trades. No API keys, no latency spikes from remote calls.

Implications and Trade-offs

This bridges Python’s sync/async chasm without forking the ecosystem. Developers gain drop-in local inference for web apps—crucial for Datasette users analyzing crypto trades, security logs, or IoT streams offline. Expect uptake: LLM’s plugin count grows weekly, and Datasette powers indie data projects from climate data to blockchain explorers.

Skeptical lens: Thread pools aren’t free. Python’s GIL caps multi-threaded CPU-bound inference (e.g., llama.cpp at 10–50 tokens/sec on consumer hardware). Overhead adds 5–20ms per call from context switching. For I/O-light local models, it’s fine; for heavy matrix math, stick to native async like llama-cpp-python‘s async mode.

AI-generated code? Claude nailed the boilerplate, but Willison audited and shipped it—smart use of tools, not blind trust. Security note: Wrappers inherit sync model permissions; vet plugins before async exposure in production Datasette.

Bottom line: This unblocks local LLMs in async stacks, slashing reliance on pricey APIs. For data-heavy ops—finance dashboards, threat intel feeds—it means faster, cheaper smarts without vendor lock-in. Track LLM’s GitHub: 0.30 already pulls 100+ stars in hours.

llm-all-models-async 0.1

The Problem in Practice

How the Fix Works

Implications and Trade-offs

Related

Highlights from my conversation about agentic engineering on Lenny’s Podcast

Welcome Gemma 4: Frontier multimodal intelligence on device

llm-gemini 0.30

Gemma 4: Byte for byte, the most capable open models

March 2026 sponsors-only newsletter

Falcon Perception