Grok 4.20 Review: 2M Context, 88.9% GPQA, $1.25/M (2026)

Grok 4.20 by xAI (March 2026): 2M context, 88.9% GPQA Diamond, 4-agent architecture. $1.25 input / $2.50 output per 1M tokens. Best for long-doc AI workflows.

Grok 4.20 is xAI's fourth-generation flagship, released in beta on February 17, 2026, featuring a 4-agent MoE architecture, 2M token context window, 88.9% GPQA Diamond, and 95% AIME 2025 accuracy. Priced at $1.25 per 1M input and $2.50 per 1M output tokens, it outputs 265 tokens per second and serves as xAI's long-context specialist, succeeding Grok 4.1.

Grok 4.20, released in beta on February 17, 2026, by xAI, is a 4-agent MoE-architecture model with a 2M token context window scoring 88.9% on GPQA Diamond. Priced at $1.25 input and $2.50 output per 1M tokens, it outputs 265 tokens per second, the fastest among frontier models at its April 2026 evaluation. Its 4-agent council reduces hallucinations by 65% versus single-model designs.

Provider: xAI · Family: Grok 4

Context window: 2,000,000 tokens

Input modalities: text, image, pdf, tool-calls · Output: text, tool-calls

About Grok 4.20

Grok 4.20 is xAI's fourth-generation flagship language model, released in public beta on February 17, 2026, with full general availability reaching the API on March 18, 2026. Built on a Mixture-of-Experts transformer backbone with an estimated 1.7 to 3 trillion total parameters, it is the most structurally distinct Grok release since the original series. Rather than routing queries to a single model instance, Grok 4.20 deploys a four-agent council at inference time: Grok (coordinator), Harper (research), Benjamin (math and code), and Lucas (synthesis and creativity). These agents run in parallel on shared weights, debate intermediate results through peer-review rounds, and synthesize a final answer. This design reduces hallucination rates from approximately 12% to 4.2% versus a single-model baseline, a 65% improvement documented in xAI internal testing. On benchmark performance, Grok 4.20 scores 88.9% on GPQA Diamond for graduate-level scientific reasoning, placing it among the top tier of proprietary models, though GPT-5.4 (92.8%) and Claude Opus 4.6 (91.3%) hold the advantage on this specific axis. AIME 2025 math performance reaches approximately 95% for the standard four-agent variant; the 16-agent Heavy mode achieves a reported perfect score. On MMLU-Pro, the model scores 86.6%, consistent with the frontier cluster. SWE-bench Verified software engineering performance sits at 75%, essentially tied with GPT-5.4 at 74.9% and marginally ahead of Claude Opus 4.6 at 74%. Gemini 3.1 Pro trails at 68.3%, a meaningful gap for production coding workflows. LM Arena crowdsourced Elo ranged between 1505 and 1535 at provisional March 2026 measurement. IFBench instruction-following accuracy is 82.9%. Grok 4.20 ships in three variants that differ primarily in context window and reasoning behavior. The multi-agent variant (grok-4.20-multi-agent-0309) supports a 2 million token context window, large enough to process entire multi-year codebases, full legal case archives, or extensive clinical trial datasets in a single API call. The reasoning and non-reasoning single-model variants (grok-4.20-0309-reasoning and grok-4.20-0309-non-reasoning) support a 1 million token context window. Output tokens count toward the declared context window rather than a separate cap. Community evaluations report strong long-context recall above 500K tokens, a marked improvement over Grok 4.1 which topped out at 131K and substantially ahead of GPT-5 at 128K. On modalities and capabilities, Grok 4.20 accepts text and image inputs natively, with PDF handling available through the X platform file processing pipeline. Outputs are text and tool-call responses. Native function calling and JSON-structured output are supported on the xAI API with full parameter-level schema definitions. Real-time data from the X platform is integrated via a live search hook, extending the effective knowledge boundary beyond the static November 2024 training cutoff for current-events queries. The Heavy variant scales to 16 parallel agents for the most demanding decomposable tasks. Native audio input and output are absent from the 4.20 release; xAI has indicated these capabilities are on the roadmap for Grok 5. As of May 2026, xAI standardized Grok 4.20 pricing at $1.25 per 1M input tokens, $2.50 per 1M output tokens, and $0.20 per 1M cached input tokens. This is a reduction from the original launch pricing of $2.00 input and $6.00 output per million tokens. To put these rates in context: processing a 500K-token research document costs approximately $0.63; a daily coding agent loop processing 1M input tokens and 200K output tokens costs $1.75; a 1,000-turn customer-support session at 2K input and 500 output tokens per turn costs $3.75. Claude Opus 4.8 at $5.00/$15.00 per 1M and GPT-5 at similar rates are materially more expensive on both input and output. Heavy mode (16 agents) multiplies total token consumption 8 to 12 times versus standard mode, so high-volume batch pipelines should use the single-model reasoning variant for cost efficiency. Grok 4.20 is available via the xAI API at api.x.ai with API key authentication. Third-party gateway access is available through OpenRouter and Fireworks AI. As of June 2026, the managed cloud launches of Grok to AWS Bedrock and Azure AI Catalog applied specifically to Grok 4.3; Grok 4.20 access on those platforms is routed via third-party gateways. The model is closed-weight and proprietary: it cannot be self-hosted, fine-tuned, or deployed in air-gapped environments. xAI's earlier models, Grok 1 (Apache 2.0) and Grok 2 (community license), were open-weight, but starting with Grok 3, xAI moved to closed weights for its flagship series. On safety and alignment, xAI applies a combination of reinforcement learning from human feedback and supervised fine-tuning with a stated design philosophy of being maximally truth-seeking. The published model cards covering Grok 4 and Grok 4.1 documented safety evaluations across abuse potential, concerning propensities, and dual-use capabilities. No separate model card was published for Grok 4.20 as of June 2026; xAI states safety methodology is continuous across the 4.x family. The model's default safety posture is configurable via system prompt, giving operators more control over refusal behavior than models with fixed alignment policies. Grok 4.20 is generally more permissive than Claude Opus 4.8 or GPT-5 on edge-case requests by default, which suits some use cases and creates risk for others. Grok 4.20 is best suited for teams working with document sets exceeding 200K tokens, where its 2M context window is a hard competitive advantage. Its 88.9% GPQA Diamond score makes it a strong choice for scientific reasoning pipelines in chemistry, biology, and physics. The 4-agent architecture benefits long-form content generation and multi-step research tasks requiring internal fact-checking. At 265 tokens per second output speed, it serves latency-tolerant batch analysis well. Teams that should look elsewhere include those building real-time voice applications (no audio I/O, 15-second time-to-first-token on reasoning variants), organizations requiring on-premise or air-gapped deployment, and enterprise teams needing strict content controls where GPT-5 or Claude Opus 4.8 are safer defaults. The static training data cutoff is November 2024. xAI uses a mix of public web text, licensed datasets, and synthetic reasoning traces; a detailed training data breakdown has not been publicly disclosed. Real-time X platform data extends the effective knowledge boundary for current-events queries. API inputs are not used for model training by default; a zero-retention enterprise option is available on request. Default data retention is 30 days for safety monitoring, then deleted unless flagged. SOC 2 Type II certification status for the Grok 4.20 API tier had not been independently confirmed as of June 2026. EU AI Act classification applies under the general-purpose AI with systemic risk obligations category. Grok 4.20 Beta launched February 17, 2026. Beta 2 on March 3, 2026, delivered five targeted fixes covering instruction following, hallucination reduction, LaTeX rendering, image search accuracy, and multi-image support. The reasoning API variant launched March 10, 2026. Full GA rollout followed March 18, 2026. Grok 4.3 became the new flagship on April 30, 2026, with Grok 4.20 remaining in service as the long-context-specialized option. Grok 5, targeting native audio and video multimodality, is on the public roadmap.

Pricing

$1.25 per 1M input tokens, $2.50 per 1M output tokens as of May 2026, reduced from $2.00/$6.00 at launch. Cached input at $0.20 per 1M (84% discount). Heavy mode (16-agent) multiplies token usage 8-12x at same per-token rate.

Key Features

Pros

Cons

Benchmarks

Frequently Asked Questions

What is Grok 4.20 and who built it?

Grok 4.20 is a frontier large language model developed by xAI, the AI company founded by Elon Musk in 2023 and currently operating as a SpaceX division. It launched in public beta on February 17, 2026, reached full general availability on March 18, 2026, and represents the fourth major architecture generation in the Grok series. The model is built on a Mixture-of-Experts transformer backbone with an estimated 1.7 to 3 trillion total parameters, making it one of the largest models at inference-time token efficiency. Its defining architectural feature is a four-agent inference council: Grok (coordinator), Harper (research), Benjamin (math and code), and Lucas (synthesis and creativity), all running in parallel on shared weights. This multi-agent design reduces hallucination rates from approximately 12% to 4.2% versus a single-model baseline, a 65% improvement documented in xAI internal testing. On GPQA Diamond, it scores 88.9%, and on AIME 2025 it reaches approximately 95%. It sits between the retired Grok 4.1 and the newer Grok 4.3 in xAI's lineup, continuing to serve as the long-context-specialized option with a 2M token window.

How much does Grok 4.20 cost per 1M tokens?

As of May 2026, Grok 4.20 costs $1.25 per 1 million input tokens and $2.50 per 1 million output tokens across all three variants: non-reasoning, reasoning, and multi-agent. Cached input tokens are priced at $0.20 per 1 million tokens, an 84% discount versus uncached input that benefits long-context workloads with repeated system prompts. This pricing represents a significant reduction from the original February 2026 launch rates of $2.00 input and $6.00 output per million tokens. For a practical workload, processing a 500K-token document costs $0.63; a daily coding agent loop at 1M input and 200K output costs $1.75; and a 1,000-turn customer-support deployment at 2K input and 500 output tokens per turn costs $3.75. Heavy mode (16 agents) multiplies total token consumption 8 to 12 times as each agent generates its own intermediate chain-of-thought, so per-task costs in Heavy mode can be substantially higher than standard mode. Compared to Claude Opus 4.8 ($5.00/$15.00 per 1M) or GPT-5 at similar rates, Grok 4.20 is materially cheaper at current pricing.

What is Grok 4.20's context window and max output?

The Grok 4.20 multi-agent variant (grok-4.20-multi-agent-0309) supports a 2 million token context window, the largest of any flagship model at its February 2026 release. The single-model reasoning and non-reasoning variants each support a 1 million token context window. Output tokens count toward the declared context window rather than an independent cap, so users processing a 1.9M-token document can still generate meaningful output before reaching the ceiling. Community evaluations report strong long-context recall above 500K tokens, a marked improvement over Grok 4.1 which capped at 131K. For context, Claude Opus 4.8 offers 1M context, GPT-5 offers 128K, and Gemini 3.1 Pro matches at 2M; Grok 4.20's multi-agent variant is tied for the lead on raw window size. PDF and multi-document inputs are handled through the X platform file processing pipeline. Sliding window or KV cache truncation behavior has not been officially documented; xAI recommends placing critical context within the first and final portions of the window for best recall performance.

How does Grok 4.20 compare on benchmarks vs GPT-5 and Claude Opus 4.6?

On GPQA Diamond, Grok 4.20 scores 88.9% versus GPT-5.4 at approximately 92.8% and Claude Opus 4.6 at approximately 91.3%, placing it third on graduate-level scientific reasoning. On SWE-bench Verified for agentic software engineering, Grok 4.20 reaches 75%, essentially tied with GPT-5.4 at 74.9% and slightly ahead of Claude Opus 4.6 at 74%; the meaningful gap is to Gemini 3.1 Pro at 68.3%. On AIME 2025 math, Grok 4.20 scores approximately 95% in standard mode, while GPT-5 and Claude are in the 90 to 92% range. On MMLU-Pro, all three models cluster in the 85 to 92% range, with Grok 4.20 at 86.6%. The LM Arena crowdsourced Elo for Grok 4.20 ranged between 1505 and 1535, competitive with the top-tier cluster. Grok 4.20's unambiguous advantage over GPT-5 (128K) is context window: 2M versus 128K is a 15-times difference that matters directly for full-codebase and full-archive workflows. Grok 4.20 does not publish an ARC-AGI 2 score, which is notable given the benchmark's focus on general fluid intelligence.

Is Grok 4.20 open source or proprietary?

Grok 4.20 is fully proprietary and closed-weight. Weights are not available for download, and the model cannot be self-hosted, fine-tuned, or run in air-gapped environments. This contrasts with earlier xAI releases: Grok 1 (314B parameters) was open-sourced under Apache 2.0 and remains available on GitHub at github.com/xai-org/grok-1, and Grok 2 was released under a community license permitting commercial use. Starting with the Grok 3 series, xAI moved to closed weights for its flagship lineup. API access requires an xAI API key via api.x.ai. Third-party gateway access is available through OpenRouter (confirmed Grok 4.20 beta listing) and Fireworks AI. As of June 2026, Grok 4.20 has not been confirmed on AWS Bedrock or Azure AI Catalog; those managed cloud launches applied to Grok 4.3. Commercial use is permitted under xAI's API terms of service.

What modalities does Grok 4.20 support?

Grok 4.20 accepts text and image inputs natively, with PDF processing available through the X platform file pipeline. On the output side, it produces text and tool-call responses. Native function calling and JSON-structured output are supported via the xAI API with full parameter-level schema definitions, enabling integration with external tools, databases, and code execution environments. The four-agent standard variant runs parallel agents at inference time; Heavy mode scales this to 16 agents for complex decomposable tasks. Real-time X platform data integration is available via a live search hook, extending knowledge beyond the November 2024 static training cutoff for current-events queries. Audio input and output are not supported in Grok 4.20; xAI has stated natively multimodal audio and video capabilities are planned for Grok 5, which is on the public roadmap. Video input is similarly absent from the current release.

Does Grok 4.20 train on user data?

xAI does not train on paid API inputs by default; a zero-retention enterprise option is available on request. Default API data retention is 30 days for safety monitoring and abuse detection, then deleted unless flagged. Users accessing Grok via the X platform or grok.com are subject to X's data retention policy, which is separate from API terms and may include use for model improvement unless opted out via account settings. SOC 2 Type II certification for the Grok 4.20 API tier had not been independently confirmed as of June 2026; xAI has referenced SOC 2 pursuit in enterprise communications without a confirmed attestation date. HIPAA-eligible processing and formal GDPR documentation are discussed for enterprise accounts; confirm specifics with xAI's enterprise sales team. The EU AI Act classifies Grok 4.20 under the general-purpose AI with systemic risk obligations category. Data residency for the API currently routes through US-based infrastructure; EU data residency options are not publicly confirmed.

Who is Grok 4.20 best for and who should avoid it?

Grok 4.20 is best suited for teams processing documents above 200K tokens: legal, scientific, and financial research where the 2M context window eliminates chunking overhead. Its 88.9% GPQA Diamond score makes it appropriate for scientific reasoning pipelines including chemistry, biology, and physics question-answering at the graduate level. The 4-agent architecture with internal peer-review benefits long-form content generation and multi-step research tasks where hallucination reduction is critical. At $1.25/$2.50 per 1M tokens, it suits cost-sensitive teams migrating from Claude Opus 4.8 or GPT-5 at higher price points. Teams that should avoid it include those building real-time voice applications (no audio I/O, 15-second time-to-first-token on reasoning variants rules out conversational interfaces), organizations requiring on-premise or air-gapped deployment (closed weights only), and strict-content-filter enterprises where GPT-5 or Claude Opus 4.8's tighter default alignment is required. The Heavy variant's token multiplication also makes it a poor fit for high-volume short-task pipelines on a tight per-token budget.

Visit Grok 4.20 Official Page