Gemini 3.1 Flash-Lite: $0.25/M Tokens & 1M Context (2026)

Gemini 3.1 Flash-Lite (Google, March 2026) costs $0.25/$1.50 per 1M tokens with 1M context, 86.9% GPQA Diamond, and a 1432 LMArena Elo for high-volume tasks.

Gemini 3.1 Flash-Lite is Google DeepMind's most cost-efficient Gemini 3 model, released March 3, 2026 with a 1,048,576-token context window, 86.9% GPQA Diamond, and a 1432 LMArena Elo. Priced at $0.25 per 1M input tokens and $1.50 per 1M output tokens (about 3.4x cheaper than Claude Haiku 4.5), it runs at roughly 277 tokens per second and adds four thinking levels for tuning cost against reasoning depth.

Gemini 3.1 Flash-Lite, released by Google DeepMind on March 3, 2026, is the cheapest model in the Gemini 3 line, scoring 86.9% on GPQA Diamond and 1432 on LMArena Elo with a 1M-token context window. It costs $0.25 per 1M input tokens and $1.50 per 1M output tokens, runs at roughly 277 tokens per second, and is built for high-volume classification, translation, and lightweight agent tasks rather than frontier coding or math.

Provider: Google DeepMind · Family: Gemini 3.1

Context window: 1,048,576 tokens · Max output: 65,536

Input modalities: text, image, audio, video, pdf · Output: text, tool-calls

About Gemini 3.1 Flash-Lite

Gemini 3.1 Flash-Lite is Google DeepMind's most cost-efficient model in the Gemini 3 family, first previewed on March 3, 2026 and promoted to general availability on May 25, 2026 after the preview model ID, gemini-3.1-flash-lite-preview, was retired. It sits below Gemini 3.1 Flash and Gemini 3.1 Pro in Google's lineup, built as a sparse mixture-of-experts transformer in the Gemini 3 architecture line and optimized for high-volume, latency-sensitive traffic such as classification, translation, retrieval, and lightweight agent steps. It replaces Gemini 2.5 Flash-Lite, which Google is shutting down on July 22, 2026. On Google's published evaluation suite, Flash-Lite scores 86.9% on GPQA Diamond, 83% on MMLU-Pro, and 76.8% on MMMU Pro for multimodal understanding, with an Artificial Analysis Intelligence Index of 34 and a Chatbot Arena Elo of 1432. Reasoning-heavy benchmarks lag the rest of the lineup: it manages 16.7% on AIME 2025 and 8.5% on Humanity's Last Exam, both well below Gemini 3.1 Pro and Gemini 3.5 Flash. Against same-tier rivals, Google reports wins on 6 of 11 published benchmarks versus GPT-5 mini and Claude Haiku 4.5. Claude Haiku 4.5's own SWE-bench Verified score of 73.3% has no published Flash-Lite counterpart, which signals Google isn't positioning this model for autonomous coding. The model accepts up to 1,048,576 tokens of input (1M) and can generate up to 65,536 output tokens in a single response. That context window is 5x larger than Claude Haiku 4.5's 200K limit, though Google has not published a needle-in-haystack recall figure for Flash-Lite specifically beyond its model card's general Gemini 3 claims, so recall above roughly 500K tokens should be treated as good rather than frontier-verified for this smaller variant. Flash-Lite accepts text, image, audio, video, and PDF input and returns text plus tool-call output. It supports function calling, structured JSON output, and four thinking levels (minimal, low, medium, high) that let developers trade reasoning depth for latency and spend on a per-request basis. It does not produce native audio or image output; Google ships those as separate models, Gemini 3.1 Flash Audio and Gemini 3.1 Flash Image. Pricing is $0.25 per 1M input tokens and $1.50 per 1M output tokens, with thinking tokens billed at the output rate. That is 40% cheaper on output than Gemini 2.5 Flash ($0.30/$2.50) despite scoring higher across reasoning and multimodal benchmarks, and roughly 3.4x cheaper than Claude Haiku 4.5 on a blended basis. A real workload of 1,000 leads per day with 400-token average responses costs about $0.62/day on Flash-Lite versus $1.02/day on Gemini 2.5 Flash, a saving of roughly $146 per year for one medium automation. Context caching can cut repeated-prompt costs by up to 90% for workloads that reuse the same system prompt or document context. Flash-Lite is available directly through the Gemini API with API key authentication, through Vertex AI with GCP IAM, and through the Gemini Enterprise Agent Platform. Google has not listed it on AWS Bedrock or Azure AI Foundry as of June 2026. As a closed, proprietary model there are no downloadable weights, quantization options, or self-hosting paths. The model's training data has a knowledge cutoff of January 2025. Google's model card states that, based on Gemini 3.1 Pro's capability assessments, Flash-Lite is unlikely to reach any Critical Capability Level, and that it improves on Gemini 2.5 Flash-Lite for both safety and tone while keeping unjustified refusals low. Output speed measured by Artificial Analysis is about 277 tokens per second on Google's own API, though time-to-first-token runs around 5.2 seconds, slower than the roughly 2-second median for similarly priced models, likely reflecting the model's default thinking behavior. Flash-Lite is a strong fit for high-volume classification and tagging pipelines, translation at scale, RAG retrieval and reranking, lightweight customer support bots, and agent steps that call tools but don't require deep multi-step planning. Teams building competition-math solvers, research-grade reasoning agents, or coding agents that need SWE-bench-class performance should look at Gemini 3.1 Pro, Gemini 3.5 Flash, or Claude Opus 4.5 instead, all of which publish stronger AIME 2025 and SWE-bench Verified numbers. Voice-first products needing sub-second responses should use Gemini 3.1 Flash Live rather than Flash-Lite's text endpoint. On the release timeline, gemini-3.1-flash-lite-preview launched March 3, 2026, was marked deprecated on May 11, 2026, and was shut down on May 25, 2026 in favor of the GA gemini-3.1-flash-lite model ID, which is the version teams should pin to going forward. Data handling on the Gemini API follows Google's standard API terms, with enterprise residency and retention controls available through Vertex AI; full details are in the official model card.

Pricing

$0.25 per 1M input tokens, $1.50 per 1M output tokens via the Gemini API and Vertex AI. Thinking tokens are billed at the output rate. Context caching can cut repeated-prompt input costs by up to 90%, though Google has not published an exact cached-token rate for this model.

Key Features

Pros

Cons

Benchmarks

Frequently Asked Questions

What is Gemini 3.1 Flash-Lite and who built it?

Gemini 3.1 Flash-Lite is Google DeepMind's most cost-efficient model in the Gemini 3 family, first previewed on March 3, 2026 and made generally available on May 25, 2026. It is built as a sparse mixture-of-experts transformer in the Gemini 3 architecture line, with Google not disclosing an exact parameter count. On Google's published evaluation suite it scores 86.9% on GPQA Diamond, 83% on MMLU-Pro, and 76.8% on MMMU Pro for multimodal understanding, with a Chatbot Arena Elo of 1432. It was designed to replace Gemini 2.5 Flash-Lite, which Google is retiring on July 22, 2026, and to compete directly with GPT-5 mini and Claude Haiku 4.5 in the fast, cheap tier. Google reports it beats both rivals on 6 of 11 published benchmarks. It sits below Gemini 3.1 Flash and Gemini 3.1 Pro in Google's lineup, aimed at high-volume, latency-sensitive workloads rather than frontier reasoning.

How much does Gemini 3.1 Flash-Lite cost per 1M tokens?

Gemini 3.1 Flash-Lite costs $0.25 per 1M input tokens and $1.50 per 1M output tokens through the Gemini API and Vertex AI, with thinking tokens billed at the output rate. That is 40% cheaper on output than its predecessor, Gemini 2.5 Flash, at $0.30/$2.50, despite scoring higher on most benchmarks. It is roughly 3.4x cheaper than Claude Haiku 4.5 on a blended basis. A real workload of 1,000 leads per day with 400-token average responses costs about $0.62/day on Flash-Lite versus $1.02/day on Gemini 2.5 Flash, saving roughly $146 per year for one medium automation. Summarizing a 100K-token document with a 1K-token output costs about $0.0265. Context caching can cut repeated-prompt input costs by up to 90%, though Google has not published an exact cached-token rate for this model. As a closed model, there is no self-hosting option.

What is Gemini 3.1 Flash-Lite's context window and max output?

Gemini 3.1 Flash-Lite has a context window of 1,048,576 tokens (1M) and a maximum output of 65,536 tokens per response. That context window is 5x larger than Claude Haiku 4.5's 200K-token limit and matches the 1M-token window Google ships across the Gemini 3 line. Google's model card describes general Gemini 3 long-context behavior but has not published a needle-in-haystack recall figure specifically for the Flash-Lite variant, so recall above roughly 500K tokens should be treated as good rather than independently verified for this smaller model. Document handling supports PDFs and multi-file inputs as part of its multimodal input set. For workloads that need verified frontier-level recall at very long depths, Gemini 3.1 Pro or Gemini 3.5 Flash are safer choices.

How does Gemini 3.1 Flash-Lite compare on benchmarks vs Claude Haiku 4.5 and GPT-5 mini?

Google reports that Gemini 3.1 Flash-Lite beats GPT-5 mini and Claude Haiku 4.5 on 6 of 11 benchmarks it published at launch, with a headline 86.9% on GPQA Diamond and a Chatbot Arena Elo of 1432. Claude Haiku 4.5 separately reports 73.3% on SWE-bench Verified, a benchmark Google has not published a Flash-Lite score for, suggesting Flash-Lite is not positioned as a coding-first model the way Haiku 4.5 is. On math and expert reasoning, Flash-Lite trails badly: 16.7% on AIME 2025 and 8.5% on Humanity's Last Exam, both far below Gemini 3.1 Pro and Gemini 3.5 Flash. In practice, a 10+ point benchmark gap on agentic coding or math tasks means Flash-Lite will need more retries or human review on those workloads, even though it wins decisively on cost and context window size.

Is Gemini 3.1 Flash-Lite open source or proprietary?

Gemini 3.1 Flash-Lite is fully proprietary and API-only; Google has not released any weights, and there is no open-weights or open-source variant. It is accessible through the direct Gemini API with API key authentication, through Google Vertex AI with GCP IAM authentication, and through the Gemini Enterprise Agent Platform. As of June 2026 it is not listed on AWS Bedrock or Azure AI Foundry, unlike some competing models from Anthropic and OpenAI that ship on multiple clouds. There are no quantization options, VRAM requirements, or self-hosting paths since the model cannot be downloaded. Commercial use is governed by Google's Generative AI Additional Terms of Service for the Gemini API.

What modalities does Gemini 3.1 Flash-Lite support?

Gemini 3.1 Flash-Lite accepts text, image, audio, video, and PDF as input, and returns text plus tool-call output. It supports native function calling and structured JSON output, making it suitable for agent pipelines that need predictable, parseable responses. It also exposes four thinking levels (minimal, low, medium, high) that control how much internal reasoning the model performs before responding, trading latency and cost for answer quality. It does not generate audio or image output natively; those capabilities are handled by separate Gemini 3.1 Flash Audio and Gemini 3.1 Flash Image models. Compared to Claude Haiku 4.5, Flash-Lite's input modality list is broader (adding audio and video), but neither model produces non-text output directly.

Does Gemini 3.1 Flash-Lite train on user data?

Gemini 3.1 Flash-Lite follows Google's standard Gemini API data handling under the Generative AI Additional Terms of Service, and Google states it does not train on data submitted through the paid API by default. Vertex AI offers enterprise-grade data residency and retention controls for customers with stricter compliance needs, with data residency options across US, EU, and Asia regions. Google's official model card and trust documentation provide the authoritative details on retention windows and certification status; specific SOC 2, ISO 27001, HIPAA, and GDPR attestations for this exact model variant were not independently confirmed during research and should be checked against Google Cloud's current compliance pages before relying on them for regulated workloads. The model's training data has a knowledge cutoff of January 2025.

Who is Gemini 3.1 Flash-Lite best for and who should avoid it?

Gemini 3.1 Flash-Lite is best for teams running high-volume classification, tagging, or translation pipelines, RAG retrieval over long documents thanks to its 1M-token context, and lightweight tool-using agents or support bots where cost per request matters most. It is also a strong, cheaper migration target for anyone still on Gemini 2.5 Flash-Lite, which Google retires on July 22, 2026. Teams should avoid it for autonomous coding agents, since no SWE-bench Verified score has been published and Gemini 3.1 Pro or Claude Opus 4.5 perform far better on agentic coding. It is also a poor fit for competition math or research-grade reasoning (16.7% AIME 2025, 8.5% Humanity's Last Exam) and for real-time voice assistants, where its roughly 5.2-second time-to-first-token is too slow; Gemini 3.1 Flash Live is the better choice there.

Visit Gemini 3.1 Flash-Lite Official Page