Welcome Gemma 4: Frontier multimodal intelligence on device

Google DeepMind just dropped Gemma 4, a family of open multimodal AI models now live on Hugging Face under Apache 2.0.

Google DeepMind just dropped Gemma 4, a family of open multimodal AI models now live on Hugging Face under Apache 2.0. These handle image, text, and audio inputs while spitting out text responses. Crucially, they run on-device—from laptops to phones—cutting reliance on cloud servers. This matters because it hands developers and users powerful AI without handing data to remote providers, slashing privacy risks and latency.

Four variants exist: Gemma 4 E2B (2.3 billion effective parameters, 5.1B total with embeddings), E4B (4.5B effective, 8B total), a dense 31B model, and a 26B mixture-of-experts (MoE) with only 4B active per token. Context windows hit 128K for the small ones and 256K for the big boys. All come in base and instruction-tuned flavors. The architecture sticks to proven components: per-layer embeddings for efficiency, shared KV cache for long contexts, and an image encoder upgraded for variable aspect ratios and tunable token counts. Audio works on the E2B/E4B models only.

Why These Specs Shift the Game

On-device multimodal AI isn’t hype—it’s practical now. Smaller models like E2B fit in 4-8GB RAM post-quantization, enabling real-time video analysis or voice transcription on mid-range hardware. The 31B and 26B MoE scale to desktops or edge servers without melting GPUs. Compare to closed rivals: these match or beat Llama 3.1 8B on arenas for vision-language tasks while staying fully open. Pre-release tests showed strong out-of-box performance, reducing fine-tuning needs.

Implications run deep. Privacy-focused apps thrive—no data leaves your device. Security pros get local threat detection via image/audio analysis. In crypto, on-chain agents could process market charts or audio news feeds offline. But skepticism checks: Google open-sources to crowdsource improvements, yet training data opacity lingers. Arena scores claim Pareto frontier status, but real-world evals vary by quantization and hardware.

Run It Yourself: Deployment Facts

Compatibility spans ecosystems. Hugging Face Transformers loads them instantly:

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("google/gemma-4-e2b-it", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-e2b-it")

llama.cpp quantizes for CPU-only runs. Convert with:

$ python convert.py --outtype q4_0 /path/to/gemma-4-e2b --vocab-dir tokenizer.model
./llama-cli -m gemma-4-e2b-q4_0.gguf -p "Describe this image: <image_token>"

MLX for Apple silicon, transformers.js for browsers via WebGPU, even Rust bindings via Mistral.rs. Fine-tuning? TRL or Unsloth on a single RTX 4090 handles the E4B in hours. Vertex AI scales it cloud-side if needed.

Benchmarks back the claims: E4B hits 75% on MMMU (multimodal understanding), edging Phi-3.5-vision, while 31B crushes 82% on GPQA science QA. Quantized E2B retains 90% quality at 4-bit. Drawbacks? Audio lags behind Whisper on accents; MoE inference needs optimized runtimes.

Bottom line: Gemma 4 accelerates on-device AI adoption. Developers bypass API costs and censorship. Users reclaim control. Test the E2B first—it’s the entry point to frontier capabilities without frontier hardware. If it delivers consistently, expect a wave of local agents disrupting cloud giants.

Welcome Gemma 4: Frontier multimodal intelligence on device

Why These Specs Shift the Game

Run It Yourself: Deployment Facts

Related

Highlights from my conversation about agentic engineering on Lenny’s Podcast

llm-gemini 0.30

Gemma 4: Byte for byte, the most capable open models

March 2026 sponsors-only newsletter

Falcon Perception

datasette-llm-usage 0.2a0