Google DeepMind's new open-weight text diffusion model generates up to 1,000 tokens per second on a single H100, making it roughly four times faster than comparable autoregressive Gemma models for local, single-user workloads.

Google DeepMind has released DiffusionGemma, an experimental open-weight model that generates text through a diffusion process rather than the token-by-token approach used by most large language models (LLMs), and is making the weights available immediately on Hugging Face under an Apache 2.0 license. [1,2]

Unlike autoregressive models, which produce text left to right one token at a time, DiffusionGemma starts with a block of 256 random placeholder tokens and refines them across multiple passes until coherent text emerges — a technique borrowed from image generation AI. [1,2] The model finalizes all outputs in one large block, a process Google describes as “denoising” a text canvas. [1]

DiffusionGemma uses a Mixture of Experts (MoE) architecture — a design in which several specialized sub-networks exist side by side and only the relevant ones activate for a given input — with 26 billion total parameters but only 3.8 billion activated during inference. [1,2] When quantized to lower numerical precision, the model fits within 18 GB of video RAM (VRAM), making it compatible with high-end consumer GPUs. [1,2]

In benchmarks, DiffusionGemma reaches approximately 700 tokens per second on an Nvidia RTX 5090 and more than 1,000 tokens per second on a single Nvidia H100 AI accelerator. [1,2] Nvidia also reports around 150 tokens per second on the DGX Spark deskside system and up to 800 tokens per second on the DGX Station. [2] Those figures represent roughly four times the throughput of similarly sized autoregressive Gemma models in single-user local inference. [1,2]

The speed advantage stems from a shift in hardware bottleneck. With autoregressive models, single-user inference is typically constrained by memory bandwidth, leaving GPU compute units largely idle while waiting for data. [2] By processing up to 256 tokens in parallel, DiffusionGemma moves the bottleneck toward raw compute, keeping GPUs more consistently occupied. [1,2]

Google and Nvidia note that the advantage is specific to local, single-user scenarios. In cloud deployments serving many parallel requests, autoregressive models already keep hardware busy, and DiffusionGemma can actually increase costs in that setting, according to Google. [2] Google also notes that on shared-memory systems such as Apple Silicon, which face their own memory-bandwidth constraints, the speed gap over autoregressive models is likely smaller. [2]

The parallel generation approach also opens up task categories where autoregressive models struggle. Because every token can reference every other token in the block — including tokens that appear later in the sequence — DiffusionGemma is better suited to tasks such as inline text editing, filling gaps in code, molecular amino acid sequencing, and mathematical graphing. [1,2] Google highlights Sudoku solving as a demonstration: each cell depends on future cells, a dependency structure that trips up standard autoregressive models but is handled more naturally by DiffusionGemma’s continuous self-correction over the full token block. [1,2]

The tradeoff is output quality. Diffusion-based text generation carries a higher error rate than autoregressive generation; a single badly predicted token can render an entire block meaningless and require regeneration, whereas a single bad pixel in an image diffusion model rarely ruins the output. [1] Diffusion models also consume more resources when the desired output is short, since the parallel denoising process does significant work even to produce just a few tokens. [1] Google explicitly recommends its standard Gemma 4 models when quality is the priority, positioning DiffusionGemma as a tool for researchers and developers experimenting with fast local workflows. [2]

Google worked with Nvidia on optimization. Nvidia quantized the model for the RTX 5090 and RTX 4090 and optimized it for Hopper and Blackwell server architectures. [2] The model is also available through the Gemini Enterprise Agent Platform Model Garden and Nvidia NIM. [2]

DiffusionGemma works out of the box with Hugging Face Transformers, vLLM (with Red Hat integration support), and MLX. [2] For fine-tuning, Google points to its own JAX-based Hackable Diffusion toolkit, along with Unsloth and the Nvidia NeMo Framework. [2] Support for llama.cpp is planned. [2]

The release builds on earlier internal work: Google DeepMind had previously demonstrated Gemini Diffusion, an experimental text diffusion model that cited speeds of 1,479 tokens per second and performed roughly on par with Gemini 2.0 Flash-Lite in benchmarks. [2] DiffusionGemma is the first member of the open Gemma 4 family to use the diffusion approach. [1] The startup Inception has been pursuing a similar direction; its Mercury 2 model shipped in early 2026 as what the company describes as the first diffusion-based reasoning model. [2]


Sources

  1. Ars Technica — Google's latest DiffusionGemma open AI model comes with a 4x speed boost
  2. The Decoder — Google's new open model DiffusionGemma generates text from noise instead of word by word

This article was drafted with AI from the cited sources and checked against them before publication. Spot an error? Let us know.