Name: DiffusionGemma 26B: 1,100 tok/s Open Diffusion LLM (2026)
Brand: Google DeepMind
Availability: InStock

Question 1

What is DiffusionGemma 26B-A4B and who built it?

Accepted Answer

DiffusionGemma 26B-A4B is Google DeepMind's first discrete text diffusion language model, released publicly on June 10, 2026 under an Apache 2.0 license. It is built on the Gemma 4 26B-A4B Mixture-of-Experts backbone, using 25.2 billion total parameters and 3.8 billion active parameters per token, with 128 fine-grained experts and top-8 routing. Instead of generating one token at a time like standard autoregressive models, it denoises parallel 256-token canvases over up to 48 iterative steps, producing all 256 tokens simultaneously per forward pass. The model supports text, image, and video inputs with text output, native function calling, configurable thinking mode, and 35+ language multilingual inference. On benchmarks, it scores 77.6% on MMLU Pro, 73.2% on GPQA Diamond, and 69.1% on AIME 2026, trading 5-19 points vs the autoregressive Gemma 4 26B-A4B in exchange for 4x higher throughput at low concurrency. It sits in Google DeepMind's experimental Gemma family and is aimed at developers who prioritize speed and open-source freedom over maximum accuracy.

Question 2

How much does DiffusionGemma 26B-A4B cost?

Accepted Answer

DiffusionGemma 26B-A4B is fully open-source under Apache 2.0, so there is no per-token charge from Google: you pay only for the compute you run it on. Weights are available free of charge on Hugging Face (google/diffusiongemma-26B-A4B-it), Kaggle, and Google Cloud Vertex AI Model Garden with no download fees or royalties. NVIDIA hosts the model at no cost via NVIDIA NIM and NVIDIA Build (capacity-limited preview as of June 2026), which is the fastest way to test the model without owning a GPU. For self-hosted inference, the NVFP4-quantized build fits within 18 GB VRAM; an RTX 4090 rented at roughly $0.50/hr serves 700+ tok/s for interactive workloads. An H100 at roughly $2.50 per hour in FP8 reaches 1,008 tok/s for production use, putting the effective cost at approximately $0.70 per 1 million tokens. The Apache 2.0 license permits commercial use with no attribution requirement and no usage caps, so total cost of ownership is purely infrastructure-driven.

Question 3

What is DiffusionGemma 26B-A4B's context window and max output?

Accepted Answer

DiffusionGemma 26B-A4B supports a 256,000-token context window, matching its autoregressive sibling Gemma 4 26B-A4B in context length. The model uses an encoder-decoder design: the encoder processes the full prompt with causal attention to populate a KV cache, while the decoder uses bidirectional attention over each 256-token generation canvas. Outputs are produced one 256-token block at a time, meaning very long outputs require sequential multi-block generation. A separate max output token limit has not been officially published as of June 2026. Independent needle-in-haystack or long-context recall benchmarks above 100K tokens have not been released for DiffusionGemma, so performance in the upper half of the context window is unverified by third parties. For PDF parsing and document extraction at moderate lengths, the bidirectional decoder attention gives DiffusionGemma a measurable advantage over causal models, as confirmed by OmniDocBench 1.5 results. Gemma 4 26B-A4B has a more established track record for multi-hundred-thousand-token tasks due to published long-context recall evals.

Question 4

How does DiffusionGemma 26B-A4B compare to Gemma 4 26B-A4B on benchmarks?

Accepted Answer

DiffusionGemma 26B-A4B trails the autoregressive Gemma 4 26B-A4B on every general knowledge and reasoning benchmark by a margin of 5 to 19 points. On MMLU Pro, DiffusionGemma scores 77.6% versus Gemma 4's 82.6%, a 5-point gap. On GPQA Diamond, the gap widens to 9 points (73.2% vs 82.3%). On AIME 2026, the largest gap appears: 69.1% vs 88.3%, a 19-point difference that makes DiffusionGemma a poor choice for competition math. LiveCodeBench v6 shows 69.1% for DiffusionGemma; Gemma 4 comparative data was not available at time of writing. The one benchmark where DiffusionGemma wins is OmniDocBench 1.5, where its bidirectional decoder attention outperforms causal attention on OCR and layout-aware document extraction. The tradeoff is explicit: DiffusionGemma sacrifices 5-19 benchmark points in exchange for roughly 4x throughput on the same hardware at low batch sizes. Teams where benchmark scores translate directly to output quality should stay with the autoregressive Gemma 4 variant.

Question 5

Is DiffusionGemma 26B-A4B open source?

Accepted Answer

Yes, DiffusionGemma 26B-A4B is fully open-source under the Apache 2.0 license, one of the most permissive open-source licenses available. Weights are freely downloadable from Hugging Face (google/diffusiongemma-26B-A4B-it), Kaggle, and Google Cloud Vertex AI Model Garden with no royalties, no attribution requirements for commercial use, and no usage restrictions beyond the Apache 2.0 terms. The NVIDIA NVFP4 quantized variant (nvidia/diffusiongemma-26B-A4B-it-NVFP4) and Unsloth GGUF variants (unsloth/diffusiongemma-26B-A4B-it-GGUF) are also available under compatible open terms. Apache 2.0 permits modification, redistribution, and commercial deployment without fees. VRAM requirements vary by precision: NVFP4 runs within 18 GB, FP8 needs approximately 28 GB, and BF16 requires 50 GB or more, necessitating multi-GPU setups. This contrasts with GPT-5 or Gemini models, which are API-only proprietary models with no weight access, fine-tuning support, or self-hosting option.

Question 6

What modalities does DiffusionGemma 26B-A4B support?

Accepted Answer

DiffusionGemma 26B-A4B accepts text, images, and video as inputs and produces text as output. Inputs can be mixed within a single prompt: images, video clips, and text can all be provided together for context-heavy multimodal reasoning tasks. Native function calling is fully supported, enabling structured tool use and agentic workflows without prompt engineering workarounds. A configurable thinking mode allows the model to show step-by-step reasoning before its final answer, similar to extended-thinking modes in Claude or Gemini. Audio input and audio output are not supported. Confirmed live capabilities include OCR, PDF parsing, chart comprehension, UI and screen parsing, video content analysis, conversational AI, code generation, and multilingual inference across 35 or more languages. Function calling uses a structured schema compatible with the OpenAI-compatible API provided by vLLM and NVIDIA NIM, so existing agentic frameworks integrate without modification.

Question 7

Does Google train on DiffusionGemma user data?

Accepted Answer

DiffusionGemma 26B-A4B is an open-weights model: when you self-host it, Google receives no data from your inference runs, so there is no Google training-on-user-data concern for self-hosted deployments. When accessed via NVIDIA NIM's hosted endpoint, NVIDIA's data handling policies apply (not Google's), and NIM's commercial terms generally prohibit using inference inputs for model training. On Google Cloud Vertex AI Model Garden, Google's standard Vertex AI data handling applies: inputs are not used for model training by default, and enterprise customers can configure additional data isolation. No SOC 2 Type II, HIPAA-eligible, or ISO 27001 certification has been specifically published for DiffusionGemma itself; since Apache 2.0 puts compliance responsibility on the operator, organizations processing regulated data should assess their self-hosted deployment independently. The model's own training data had a cutoff of January 2025, covering web documents, code, mathematics, and images across 140 or more languages.

Question 8

Who is DiffusionGemma 26B-A4B best for and who should avoid it?

Accepted Answer

DiffusionGemma 26B-A4B is best for developers and researchers who need very high single-request throughput from an open-source model and are willing to trade 5-19 benchmark points versus the autoregressive Gemma 4 26B-A4B. Ideal use cases include interactive local applications requiring 700+ tok/s on consumer GPUs, document parsing and OCR pipelines that benefit from bidirectional attention (as evidenced by OmniDocBench 1.5 results), and research into discrete diffusion approaches for text generation. Teams running agentic workflows on open-source models who want Apache 2.0 freedom and zero per-token costs also fit this profile well. Teams that should avoid DiffusionGemma: any application where benchmark accuracy maps directly to output quality (coding assistants, scientific reasoning, competition math), since Gemma 4 26B-A4B beats it by 5-19 points. High-concurrency API services should test batch throughput carefully, as autoregressive models close the gap via request batching above 10-16 concurrent requests. Real-time voice or audio applications are a poor fit since the model does not support audio input or output.

DiffusionGemma 26B: 1,100 tok/s Open Diffusion LLM (2026)

About DiffusionGemma 26B-A4B

Pricing

Key Features

Pros

Cons

Benchmarks

Frequently Asked Questions