Question 1

What is Grok 4.20 and who built it?

Accepted Answer

Grok 4.20 is a frontier large language model developed by xAI, the AI company founded by Elon Musk in 2023 and currently operating as a SpaceX division. It launched in public beta on February 17, 2026, reached full general availability on March 18, 2026, and represents the fourth major architecture generation in the Grok series. The model is built on a Mixture-of-Experts transformer backbone with an estimated 1.7 to 3 trillion total parameters, making it one of the largest models at inference-time token efficiency. Its defining architectural feature is a four-agent inference council: Grok (coordinator), Harper (research), Benjamin (math and code), and Lucas (synthesis and creativity), all running in parallel on shared weights. This multi-agent design reduces hallucination rates from approximately 12% to 4.2% versus a single-model baseline, a 65% improvement documented in xAI internal testing. On GPQA Diamond, it scores 88.9%, and on AIME 2025 it reaches approximately 95%. It sits between the retired Grok 4.1 and the newer Grok 4.3 in xAI's lineup, continuing to serve as the long-context-specialized option with a 2M token window.

Question 2

How much does Grok 4.20 cost per 1M tokens?

Accepted Answer

As of May 2026, Grok 4.20 costs $1.25 per 1 million input tokens and $2.50 per 1 million output tokens across all three variants: non-reasoning, reasoning, and multi-agent. Cached input tokens are priced at $0.20 per 1 million tokens, an 84% discount versus uncached input that benefits long-context workloads with repeated system prompts. This pricing represents a significant reduction from the original February 2026 launch rates of $2.00 input and $6.00 output per million tokens. For a practical workload, processing a 500K-token document costs $0.63; a daily coding agent loop at 1M input and 200K output costs $1.75; and a 1,000-turn customer-support deployment at 2K input and 500 output tokens per turn costs $3.75. Heavy mode (16 agents) multiplies total token consumption 8 to 12 times as each agent generates its own intermediate chain-of-thought, so per-task costs in Heavy mode can be substantially higher than standard mode. Compared to Claude Opus 4.8 ($5.00/$15.00 per 1M) or GPT-5 at similar rates, Grok 4.20 is materially cheaper at current pricing.

Question 3

What is Grok 4.20's context window and max output?

Accepted Answer

The Grok 4.20 multi-agent variant (grok-4.20-multi-agent-0309) supports a 2 million token context window, the largest of any flagship model at its February 2026 release. The single-model reasoning and non-reasoning variants each support a 1 million token context window. Output tokens count toward the declared context window rather than an independent cap, so users processing a 1.9M-token document can still generate meaningful output before reaching the ceiling. Community evaluations report strong long-context recall above 500K tokens, a marked improvement over Grok 4.1 which capped at 131K. For context, Claude Opus 4.8 offers 1M context, GPT-5 offers 128K, and Gemini 3.1 Pro matches at 2M; Grok 4.20's multi-agent variant is tied for the lead on raw window size. PDF and multi-document inputs are handled through the X platform file processing pipeline. Sliding window or KV cache truncation behavior has not been officially documented; xAI recommends placing critical context within the first and final portions of the window for best recall performance.

Question 4

How does Grok 4.20 compare on benchmarks vs GPT-5 and Claude Opus 4.6?

Accepted Answer

On GPQA Diamond, Grok 4.20 scores 88.9% versus GPT-5.4 at approximately 92.8% and Claude Opus 4.6 at approximately 91.3%, placing it third on graduate-level scientific reasoning. On SWE-bench Verified for agentic software engineering, Grok 4.20 reaches 75%, essentially tied with GPT-5.4 at 74.9% and slightly ahead of Claude Opus 4.6 at 74%; the meaningful gap is to Gemini 3.1 Pro at 68.3%. On AIME 2025 math, Grok 4.20 scores approximately 95% in standard mode, while GPT-5 and Claude are in the 90 to 92% range. On MMLU-Pro, all three models cluster in the 85 to 92% range, with Grok 4.20 at 86.6%. The LM Arena crowdsourced Elo for Grok 4.20 ranged between 1505 and 1535, competitive with the top-tier cluster. Grok 4.20's unambiguous advantage over GPT-5 (128K) is context window: 2M versus 128K is a 15-times difference that matters directly for full-codebase and full-archive workflows. Grok 4.20 does not publish an ARC-AGI 2 score, which is notable given the benchmark's focus on general fluid intelligence.

Question 5

Is Grok 4.20 open source or proprietary?

Accepted Answer

Grok 4.20 is fully proprietary and closed-weight. Weights are not available for download, and the model cannot be self-hosted, fine-tuned, or run in air-gapped environments. This contrasts with earlier xAI releases: Grok 1 (314B parameters) was open-sourced under Apache 2.0 and remains available on GitHub at github.com/xai-org/grok-1, and Grok 2 was released under a community license permitting commercial use. Starting with the Grok 3 series, xAI moved to closed weights for its flagship lineup. API access requires an xAI API key via api.x.ai. Third-party gateway access is available through OpenRouter (confirmed Grok 4.20 beta listing) and Fireworks AI. As of June 2026, Grok 4.20 has not been confirmed on AWS Bedrock or Azure AI Catalog; those managed cloud launches applied to Grok 4.3. Commercial use is permitted under xAI's API terms of service.

Question 6

What modalities does Grok 4.20 support?

Accepted Answer

Grok 4.20 accepts text and image inputs natively, with PDF processing available through the X platform file pipeline. On the output side, it produces text and tool-call responses. Native function calling and JSON-structured output are supported via the xAI API with full parameter-level schema definitions, enabling integration with external tools, databases, and code execution environments. The four-agent standard variant runs parallel agents at inference time; Heavy mode scales this to 16 agents for complex decomposable tasks. Real-time X platform data integration is available via a live search hook, extending knowledge beyond the November 2024 static training cutoff for current-events queries. Audio input and output are not supported in Grok 4.20; xAI has stated natively multimodal audio and video capabilities are planned for Grok 5, which is on the public roadmap. Video input is similarly absent from the current release.

Question 7

Does Grok 4.20 train on user data?

Accepted Answer

xAI does not train on paid API inputs by default; a zero-retention enterprise option is available on request. Default API data retention is 30 days for safety monitoring and abuse detection, then deleted unless flagged. Users accessing Grok via the X platform or grok.com are subject to X's data retention policy, which is separate from API terms and may include use for model improvement unless opted out via account settings. SOC 2 Type II certification for the Grok 4.20 API tier had not been independently confirmed as of June 2026; xAI has referenced SOC 2 pursuit in enterprise communications without a confirmed attestation date. HIPAA-eligible processing and formal GDPR documentation are discussed for enterprise accounts; confirm specifics with xAI's enterprise sales team. The EU AI Act classifies Grok 4.20 under the general-purpose AI with systemic risk obligations category. Data residency for the API currently routes through US-based infrastructure; EU data residency options are not publicly confirmed.

Question 8

Who is Grok 4.20 best for and who should avoid it?

Accepted Answer

Grok 4.20 is best suited for teams processing documents above 200K tokens: legal, scientific, and financial research where the 2M context window eliminates chunking overhead. Its 88.9% GPQA Diamond score makes it appropriate for scientific reasoning pipelines including chemistry, biology, and physics question-answering at the graduate level. The 4-agent architecture with internal peer-review benefits long-form content generation and multi-step research tasks where hallucination reduction is critical. At $1.25/$2.50 per 1M tokens, it suits cost-sensitive teams migrating from Claude Opus 4.8 or GPT-5 at higher price points. Teams that should avoid it include those building real-time voice applications (no audio I/O, 15-second time-to-first-token on reasoning variants rules out conversational interfaces), organizations requiring on-premise or air-gapped deployment (closed weights only), and strict-content-filter enterprises where GPT-5 or Claude Opus 4.8's tighter default alignment is required. The Heavy variant's token multiplication also makes it a poor fit for high-volume short-task pipelines on a tight per-token budget.

Grok 4.20 Review: 2M Context, 88.9% GPQA, $1.25/M (2026)

About Grok 4.20

Pricing

Key Features

Pros

Cons

Benchmarks

Frequently Asked Questions