DiffusionGemma 26B: 1,100 tok/s Open Diffusion LLM (2026)

DiffusionGemma 26B-A4B by Google DeepMind: open-source text diffusion model, 1,100 tok/s on H100, 256K context, 73.2% GPQA Diamond. Free under Apache 2.0.

DiffusionGemma 26B-A4B is Google DeepMind's experimental open-source text diffusion model, released June 10, 2026, reaching 1,100 tokens per second on a single H100 in FP8 with 73.2% GPQA Diamond and a 256,000-token context window. Built on the Gemma 4 26B-A4B MoE backbone with 25.2 billion total parameters (3.8 billion active), it is free to self-host under Apache 2.0 via Hugging Face, NVIDIA NIM, and Google Cloud Vertex AI.

With 1,100 tokens per second on a single H100 in FP8, DiffusionGemma 26B-A4B is Google DeepMind's open-source discrete diffusion model released June 10, 2026. It scores 73.2% on GPQA Diamond and 77.6% on MMLU Pro, running on a 25.2B-parameter MoE backbone with a 256K-token context window. Weights are free under Apache 2.0 on Hugging Face.

Provider: Google · Family: DiffusionGemma

Context window: 256,000 tokens

Input modalities: text, image, video, tool-calls · Output: text, tool-calls

About DiffusionGemma 26B-A4B

DiffusionGemma 26B-A4B is Google DeepMind's first discrete text diffusion language model, released publicly on June 10, 2026 under an Apache 2.0 license. Built on the same Mixture-of-Experts backbone as Gemma 4 26B-A4B (128 fine-grained experts, top-8 routing, 25.2 billion total parameters with 3.8 billion active per token), the model replaces the standard one-token-at-a-time decoding loop with a parallel denoising approach that generates 256 tokens simultaneously per forward pass. Rather than predicting the next token, the diffusion process starts with a noisy canvas of 256 masked tokens and iteratively refines them over up to 48 denoising steps. The result is a model targeting speed-critical, low-concurrency local workloads: inline editing, rapid prototyping, and interactive assistants where raw throughput matters more than maximizing benchmark scores. On quality benchmarks, DiffusionGemma 26B-A4B delivers solid but not class-leading scores. It achieves 77.6% on MMLU Pro, 73.2% on GPQA Diamond, 69.1% on AIME 2026 (no tools), and 69.1% on LiveCodeBench v6, with a Codeforces ELO of 1,429. Against its autoregressive sibling Gemma 4 26B-A4B, the diffusion variant trails on every general benchmark: MMLU Pro 77.6% vs 82.6%, GPQA Diamond 73.2% vs 82.3%, AIME 2026 69.1% vs 88.3%. The one area where DiffusionGemma wins is document parsing: on OmniDocBench 1.5, its bidirectional attention gives it a structural edge over causal models at OCR and layout-aware extraction. HumanEval pass@1 no-tools scores 11.0%, a known weakness tied to the block-generation approach struggling with left-to-right reasoning tasks that require predicting exact output format character by character. DiffusionGemma 26B-A4B supports a 256,000-token context window. The model uses an encoder-decoder design: the encoder processes the full prompt with causal attention to populate a KV cache, while the decoder uses bidirectional attention over each 256-token generation canvas. Outputs are produced one 256-token block at a time, so very long outputs require sequential multi-block generation. Independent needle-in-haystack recall benchmarks above 100K tokens have not been published as of June 2026. For PDF parsing and document extraction tasks, the bidirectional decoder attention gives DiffusionGemma a measurable advantage over causal models on OmniDocBench 1.5, even at moderate document lengths. The model accepts text, images, and video as inputs and produces text output. Inputs can be mixed in a single prompt: images, video clips, and text can all be provided together for context-heavy reasoning tasks. Native function calling is fully supported, enabling structured tool use and agentic workflows. A configurable thinking mode allows step-by-step reasoning before the final answer. Audio input and output are not supported. Confirmed live capabilities include conversational AI, text summarization, code generation, image and document understanding (OCR, chart comprehension, PDF parsing, screen and UI parsing), video content analysis, and multilingual inference across 35 or more languages. DiffusionGemma 26B-A4B is released under the Apache 2.0 license and is free to self-host. Weights are available on Hugging Face (google/diffusiongemma-26B-A4B-it), Kaggle, and Google Cloud Vertex AI Model Garden with no download fees or royalties. NVIDIA provides a free hosted inference endpoint via NVIDIA NIM and NVIDIA Build (capacity-limited preview as of June 2026). For self-hosted deployments, the NVFP4-quantized build (nvidia/diffusiongemma-26B-A4B-it-NVFP4) runs within 18 GB VRAM, fitting on an RTX 4090 or L40S. FP8 requires approximately 28 GB (suitable for an H100 SXM). Full BF16 needs 50 GB or more, requiring a multi-GPU setup. On cloud rentals, an H100 at roughly $2.50 per hour in FP8 can serve this model comfortably for low-concurrency workloads, reaching 1,008 tokens per second. The model achieves 1,008 tokens per second on an H100 in FP8 and 1,288 tokens per second on an H200, versus roughly 250 to 300 tok/s for a comparably-sized autoregressive model on identical hardware. On an RTX 5090 with NVFP4, throughput exceeds 700 tok/s. On NVIDIA DGX Spark hardware, single-request generation with thinking disabled exceeded 100 tok/s. vLLM added first-class support for DiffusionGemma at launch (it is the first discrete diffusion LLM natively supported in vLLM), with same-day support also available in Hugging Face Transformers, MLX, and SGLang. The speed advantage applies specifically to low-concurrency inference; in high-QPS cloud serving, autoregressive models can batch many requests to saturate compute, potentially narrowing or eliminating the throughput gap. DiffusionGemma 26B-A4B was trained with Google's standard AI safety procedures: CSAM filtering, sensitive personal information removal, and content quality filtering aligned with Google's AI policies. The pre-training corpus spans web documents, code, mathematics, and images across 140 or more languages, with a data collection cutoff of January 2025. Post-training evaluations are documented in the official model card at ai.google.dev/gemma/docs/diffusiongemma/model_card. The model inherits Gemma family RLHF-based instruction following and safety alignment. No separate red-teaming disclosure or third-party safety audit has been published as of June 2026. The safety posture is comparable to other Gemma instruction-tuned models: balanced defaults with refusals for clear harms, without the heavy restrictions of API-only frontier models. DiffusionGemma 26B-A4B is the right choice for teams running speed-sensitive, low-concurrency, local inference workloads where 1,100 tok/s on consumer-grade hardware matters more than maximizing benchmark scores. Document extraction, OCR-heavy pipelines, and layout parsing benefit from the bidirectional attention. Researchers studying diffusion-based LLMs will find this the most capable open-source diffusion model available as of June 2026. Teams requiring the highest accuracy on AIME, GPQA, or coding benchmarks should prefer the autoregressive Gemma 4 26B-A4B (88.3% AIME vs 69.1%), which beats DiffusionGemma on every quality metric. Teams deploying at high concurrency on cloud GPUs may not see the speed advantage in production, since batching autoregressive inference closes the throughput gap at scale. DiffusionGemma 26B-A4B is the first model in Google's diffusion LLM line, released June 10, 2026 with day-zero support across vLLM, Hugging Face Transformers, MLX, and SGLang. On June 12, 2026, Unsloth Studio added support (v0.1.463-beta or 2026.6.6). NVIDIA simultaneously released a quantized NVFP4 variant (nvidia/diffusiongemma-26B-A4B-it-NVFP4) enabling 18 GB VRAM deployment. The architecture represents Google DeepMind's directional bet that diffusion-based decoding is worth trading 5-20 benchmark points for a 4x local throughput gain. A successor with an improved quality-speed trade-off or alternative block sizes is expected but unannounced as of June 2026.

Pricing

Open weights under Apache 2.0 with no per-token charges. Self-hosted compute cost: approximately $0.50/hr on an RTX 4090 (18 GB NVFP4, 700+ tok/s) to $2.50/hr on an H100 (28 GB FP8, 1,008 tok/s). NVIDIA NIM hosted inference is free in preview. Effective infra cost is roughly $0.70 per 1M tokens on an H100.

Key Features

Pros

Cons

Benchmarks

Frequently Asked Questions

What is DiffusionGemma 26B-A4B and who built it?

DiffusionGemma 26B-A4B is Google DeepMind's first discrete text diffusion language model, released publicly on June 10, 2026 under an Apache 2.0 license. It is built on the Gemma 4 26B-A4B Mixture-of-Experts backbone, using 25.2 billion total parameters and 3.8 billion active parameters per token, with 128 fine-grained experts and top-8 routing. Instead of generating one token at a time like standard autoregressive models, it denoises parallel 256-token canvases over up to 48 iterative steps, producing all 256 tokens simultaneously per forward pass. The model supports text, image, and video inputs with text output, native function calling, configurable thinking mode, and 35+ language multilingual inference. On benchmarks, it scores 77.6% on MMLU Pro, 73.2% on GPQA Diamond, and 69.1% on AIME 2026, trading 5-19 points vs the autoregressive Gemma 4 26B-A4B in exchange for 4x higher throughput at low concurrency. It sits in Google DeepMind's experimental Gemma family and is aimed at developers who prioritize speed and open-source freedom over maximum accuracy.

How much does DiffusionGemma 26B-A4B cost?

DiffusionGemma 26B-A4B is fully open-source under Apache 2.0, so there is no per-token charge from Google: you pay only for the compute you run it on. Weights are available free of charge on Hugging Face (google/diffusiongemma-26B-A4B-it), Kaggle, and Google Cloud Vertex AI Model Garden with no download fees or royalties. NVIDIA hosts the model at no cost via NVIDIA NIM and NVIDIA Build (capacity-limited preview as of June 2026), which is the fastest way to test the model without owning a GPU. For self-hosted inference, the NVFP4-quantized build fits within 18 GB VRAM; an RTX 4090 rented at roughly $0.50/hr serves 700+ tok/s for interactive workloads. An H100 at roughly $2.50 per hour in FP8 reaches 1,008 tok/s for production use, putting the effective cost at approximately $0.70 per 1 million tokens. The Apache 2.0 license permits commercial use with no attribution requirement and no usage caps, so total cost of ownership is purely infrastructure-driven.

What is DiffusionGemma 26B-A4B's context window and max output?

DiffusionGemma 26B-A4B supports a 256,000-token context window, matching its autoregressive sibling Gemma 4 26B-A4B in context length. The model uses an encoder-decoder design: the encoder processes the full prompt with causal attention to populate a KV cache, while the decoder uses bidirectional attention over each 256-token generation canvas. Outputs are produced one 256-token block at a time, meaning very long outputs require sequential multi-block generation. A separate max output token limit has not been officially published as of June 2026. Independent needle-in-haystack or long-context recall benchmarks above 100K tokens have not been released for DiffusionGemma, so performance in the upper half of the context window is unverified by third parties. For PDF parsing and document extraction at moderate lengths, the bidirectional decoder attention gives DiffusionGemma a measurable advantage over causal models, as confirmed by OmniDocBench 1.5 results. Gemma 4 26B-A4B has a more established track record for multi-hundred-thousand-token tasks due to published long-context recall evals.

How does DiffusionGemma 26B-A4B compare to Gemma 4 26B-A4B on benchmarks?

DiffusionGemma 26B-A4B trails the autoregressive Gemma 4 26B-A4B on every general knowledge and reasoning benchmark by a margin of 5 to 19 points. On MMLU Pro, DiffusionGemma scores 77.6% versus Gemma 4's 82.6%, a 5-point gap. On GPQA Diamond, the gap widens to 9 points (73.2% vs 82.3%). On AIME 2026, the largest gap appears: 69.1% vs 88.3%, a 19-point difference that makes DiffusionGemma a poor choice for competition math. LiveCodeBench v6 shows 69.1% for DiffusionGemma; Gemma 4 comparative data was not available at time of writing. The one benchmark where DiffusionGemma wins is OmniDocBench 1.5, where its bidirectional decoder attention outperforms causal attention on OCR and layout-aware document extraction. The tradeoff is explicit: DiffusionGemma sacrifices 5-19 benchmark points in exchange for roughly 4x throughput on the same hardware at low batch sizes. Teams where benchmark scores translate directly to output quality should stay with the autoregressive Gemma 4 variant.

Is DiffusionGemma 26B-A4B open source?

Yes, DiffusionGemma 26B-A4B is fully open-source under the Apache 2.0 license, one of the most permissive open-source licenses available. Weights are freely downloadable from Hugging Face (google/diffusiongemma-26B-A4B-it), Kaggle, and Google Cloud Vertex AI Model Garden with no royalties, no attribution requirements for commercial use, and no usage restrictions beyond the Apache 2.0 terms. The NVIDIA NVFP4 quantized variant (nvidia/diffusiongemma-26B-A4B-it-NVFP4) and Unsloth GGUF variants (unsloth/diffusiongemma-26B-A4B-it-GGUF) are also available under compatible open terms. Apache 2.0 permits modification, redistribution, and commercial deployment without fees. VRAM requirements vary by precision: NVFP4 runs within 18 GB, FP8 needs approximately 28 GB, and BF16 requires 50 GB or more, necessitating multi-GPU setups. This contrasts with GPT-5 or Gemini models, which are API-only proprietary models with no weight access, fine-tuning support, or self-hosting option.

What modalities does DiffusionGemma 26B-A4B support?

DiffusionGemma 26B-A4B accepts text, images, and video as inputs and produces text as output. Inputs can be mixed within a single prompt: images, video clips, and text can all be provided together for context-heavy multimodal reasoning tasks. Native function calling is fully supported, enabling structured tool use and agentic workflows without prompt engineering workarounds. A configurable thinking mode allows the model to show step-by-step reasoning before its final answer, similar to extended-thinking modes in Claude or Gemini. Audio input and audio output are not supported. Confirmed live capabilities include OCR, PDF parsing, chart comprehension, UI and screen parsing, video content analysis, conversational AI, code generation, and multilingual inference across 35 or more languages. Function calling uses a structured schema compatible with the OpenAI-compatible API provided by vLLM and NVIDIA NIM, so existing agentic frameworks integrate without modification.

Does Google train on DiffusionGemma user data?

DiffusionGemma 26B-A4B is an open-weights model: when you self-host it, Google receives no data from your inference runs, so there is no Google training-on-user-data concern for self-hosted deployments. When accessed via NVIDIA NIM's hosted endpoint, NVIDIA's data handling policies apply (not Google's), and NIM's commercial terms generally prohibit using inference inputs for model training. On Google Cloud Vertex AI Model Garden, Google's standard Vertex AI data handling applies: inputs are not used for model training by default, and enterprise customers can configure additional data isolation. No SOC 2 Type II, HIPAA-eligible, or ISO 27001 certification has been specifically published for DiffusionGemma itself; since Apache 2.0 puts compliance responsibility on the operator, organizations processing regulated data should assess their self-hosted deployment independently. The model's own training data had a cutoff of January 2025, covering web documents, code, mathematics, and images across 140 or more languages.

Who is DiffusionGemma 26B-A4B best for and who should avoid it?

DiffusionGemma 26B-A4B is best for developers and researchers who need very high single-request throughput from an open-source model and are willing to trade 5-19 benchmark points versus the autoregressive Gemma 4 26B-A4B. Ideal use cases include interactive local applications requiring 700+ tok/s on consumer GPUs, document parsing and OCR pipelines that benefit from bidirectional attention (as evidenced by OmniDocBench 1.5 results), and research into discrete diffusion approaches for text generation. Teams running agentic workflows on open-source models who want Apache 2.0 freedom and zero per-token costs also fit this profile well. Teams that should avoid DiffusionGemma: any application where benchmark accuracy maps directly to output quality (coding assistants, scientific reasoning, competition math), since Gemma 4 26B-A4B beats it by 5-19 points. High-concurrency API services should test batch throughput carefully, as autoregressive models close the gap via request batching above 10-16 concurrent requests. Real-time voice or audio applications are a poor fit since the model does not support audio input or output.

Visit DiffusionGemma 26B-A4B Official Page