Claude Opus 4.8: 88.6% SWE-bench, $5/$25 Per 1M (May 2026)
Claude Opus 4.8 (released May 28, 2026): 1M context, 88.6% SWE-bench Verified, 96.7% USAMO. Agentic coding + reasoning. $5/$25/1M tokens. Tops GPT-5.5, Gemini.
Claude Opus 4.8 is Anthropic's flagship (released May 28, 2026) with 1M context and unmatched agentic coding performance: 88.6% SWE-bench, 69.2% SWE-bench Pro, 96.7% USAMO. Priced $5/$25 per 1M tokens, same as Opus 4.7. Excels at long-horizon automation, mathematical reasoning, and code review. Dynamic Workflows enable large-scale orchestration.
Claude Opus 4.8, released May 28, 2026 by Anthropic, is the flagship frontier model with 1M context and 88.6% SWE-bench Verified score. Costs $5 input / $25 output per 1M tokens. Tops GPT-5.5 on agentic coding (69.2% SWE-bench Pro), reasoning (96.7% USAMO), and multi-turn tool orchestration. Dynamic Workflows enable parallel subagent execution.
Provider: Anthropic · Family: Claude 4
Context window: 1,000,000 tokens · Max output: 128,000
Input modalities: text, image, pdf, tool-calls · Output: text, tool-calls
About Claude Opus 4.8
Claude Opus 4.8 is Anthropic's most capable generally available model, released on May 28, 2026. It is a dense Transformer with parameter count undisclosed (estimated ~600B based on capability footprint). The model builds on Opus 4.7's foundation with targeted improvements in agentic coding, mathematical reasoning, and behavioral consistency. Positioned as the flagship in Anthropic's Claude Opus family, it is designed to power long-running, multi-step autonomous workflows in enterprise environments, research, and developer tooling. The model maintains the same $5/$25 per-million-token pricing as its predecessor, with new fast-mode pricing at $10/$50 (3x cheaper than 4.7's fast mode). On frontier coding benchmarks, Opus 4.8 tops competitors across both dimensions. SWE-bench Verified hits 88.6% (up from 87.6% on Opus 4.7, ahead of GPT-5.5's estimates). SWE-bench Pro—the harder agentic coding benchmark—reaches 69.2%, a jump of 4.9 points from Opus 4.7's 64.3% and 10.6 points ahead of GPT-5.5's 58.6%. On mathematical reasoning (USAMO 2026), the model achieves 96.7%, a stunning leap from Opus 4.7's 69.3%. GPQA Diamond (graduate-level reasoning) stands at 93.6%. Terminal-Bench 2.1 (multi-turn coding and debugging) reaches 74.6%, up from 66.1% on Opus 4.7. Humanity's Last Exam scores 49.8% without tools and 57.9% with tools—the highest in the field. These benchmarks signal dominance in reasoning and code-centric tasks. The context window is 1M tokens on the Claude API, AWS Bedrock, and Google Vertex AI (200k on Microsoft Foundry). Standard max output is 128k tokens, with a beta 300k output option via the output-300k-2026-03-24 header on Batch API. Long-context recall (needle-in-haystack) is reliable above 100k tokens—the model maintains 99%+ accuracy on fact retrieval deep in the context window. The model architecture preserves prompt caching at a lower 1,024-token minimum (down from Opus 4.7), slashing repeat-context costs to $0.50 per 1M cached input tokens (a 90% discount vs. $5.00 standard). Multimodal inputs include text, images, and PDFs natively. The model processes up to 100 images per request and reads PDF documents directly without extraction. Tool use is native: function calling follows OpenAI-compatible schema with support for parallel tool calls. Computer use (screen reading, keyboard/mouse control) is verified on OSWorld at 83.4%, enabling desktop automation. However, there is no native audio input, audio output, or video processing—audio workflows must pair Opus 4.8 with a separate ASR/TTS model. Structured output (JSON mode) is available via the messages API. Pricing remains stable at $5 per 1M input tokens and $25 per 1M output tokens. Prompt caching reduces cached input cost to $0.50 per 1M. Fast mode (research preview) costs $10 input / $50 output and delivers ~2.5x faster output token generation. Batch API provides 50% discount ($2.50 input / $12.50 output, async within 24h). Worked example: a 100k-token research analysis costs $0.50 (with caching) to $0.32 (paid model). Daily agentic loop (1M input / 200k output) costs $6.00 standard or $10.00 in fast mode. Customer support chatbot (1000 turns, avg 2k input / 500 output per turn) costs $13.50 total. Deployment options span the major cloud platforms. The Claude API (api.anthropic.com) offers direct access with API Key auth. AWS Bedrock hosts the model with IAM-based auth across us-east-1, us-west-2, eu-central-1, ap-northeast-1, and other regions. Google Vertex AI serves the model with GCP IAM auth. Microsoft Foundry provides access (200k context only) with Azure identity auth. No self-hosting or air-gapped deployment option exists—Opus 4.8 is API-only, proprietary weights. SDKs available in Python, TypeScript, Java, Go, and Ruby. Safety posture is balanced with measurable alignment improvements. Training data cutoff is January 2026. Constitutional AI and RLHF alignment are standard. Anthropic has not yet published a dedicated Opus 4.8 system card (most recent is Opus 4.6, February 2026), but internal evaluations show Opus 4.8 is 4x less likely than Opus 4.7 to overlook code flaws it has written—achieving 0% on 'uncritically reporting flawed results' evaluations, a first for the Opus line. The model refuses clear harms (CSAM, weapons, malware code) and shows improved honesty, with calibrated refusals on edge cases. Inputs and outputs are retained for 30 days for abuse monitoring; zero-retention option available on enterprise plans. SOC 2 Type II, ISO 27001, HIPAA-eligible, and GDPR-compliant. Behavioral improvements over Opus 4.7 include better long-context handling (fewer compactions, faster recovery), calibrated reasoning effort (aligned behavior at each effort level), and tool triggering (fewer skipped required tool calls). The model shows stronger consistency in long-horizon agentic loops—critical for multi-step workflows running 8+ hours. New feature: Dynamic Workflows in Claude Code (Enterprise/Team/Max plans, research preview) allows Opus 4.8 to plan work, spawn up to 1,000 parallel subagents, and verify outputs before final report. Fast-mode price drop makes the 2.5x-faster variant accessible for latency-sensitive workloads (real-time customer support, instant analytics). Who should use Opus 4.8: agentic coding teams running long autonomous loops, enterprise research teams analyzing 100K+ documents, mathematical problem-solving workflows, code review and refactoring at scale, customer-support automation requiring multi-turn tool use, and teams leveraging prompt caching on static contexts (system prompts, knowledge bases). Who should avoid: teams needing on-device inference (proprietary, no self-host), voice-first assistants (no native audio), or sub-500ms latency (standard mode ~850ms p50). GPT-5.5 leads on pure speed and real-time constraints. Gemini 3.1 Deep Think excels on mathematical theory. DeepSeek V3 and open-source models (Llama 4) are viable for cost-constrained or air-gapped setups.
Pricing
Standard: $5.00 input, $25.00 output per 1M tokens. Cached input (1024+ token minimum): $0.50 per 1M. Fast mode: $10.00/$50.00 (2.5x faster tokens). Batch API: 50% off ($2.50/$12.50, async 24h).
Key Features
- 1M Context Window: Reliably recalls facts and code above 100K tokens. Verified 99%+ needle-in-haystack accuracy. Fully GA on Claude API, Bedrock, Vertex AI.
- Adaptive Thinking: Configurable reasoning depth per request. Balances accuracy and latency. Recommended 2K-8K token budgets for most tasks.
- Prompt Caching: 90% cost savings on cached input (1,024+ token min). Cache TTL 24h. Ideal for agentic loops with static system prompts or knowledge bases.
- Computer Use & Tool Orchestration: Screen reading, keyboard/mouse control verified at 83.4% (OSWorld). Parallel tool calls, structured JSON output, function calling.
- Dynamic Workflows: Claude Code (Enterprise/Team/Max, research preview). Plan work, spawn up to 1,000 parallel subagents, verify outputs, report results.
Pros
- Tops SWE-bench Verified at 88.6%, SWE-bench Pro at 69.2%, beating GPT-5.5 and Gemini 3.1 Pro on coding.
- 96.7% on USAMO 2026 (27.4 points up from 4.7), signaling frontier math reasoning capabilities.
- Achieves 0% on code-flaw evals—first Claude model to achieve perfect score; 4x less likely to miss bugs.
- 1M context with reliable long-recall, 90% cheaper caching, and adaptive thinking for efficient reasoning.
- Same $5/$25 pricing as Opus 4.7, with 3x cheaper fast mode ($10/$50) and new Dynamic Workflows.
Cons
- No native audio or video input; must integrate separate ASR/TTS or vision models.
- Proprietary weights, API-only; cannot self-host, deploy air-gapped, or fine-tune.
- Standard latency ~850ms p50 (slower than GPT-5.5 mini's ~320ms); fast mode trades cost for speed.
- System card not yet published; internal evals strong but third-party red-team data sparse.
- Parameter count undisclosed; estimated ~600B but unconfirmed.
Benchmarks
- usamo 2026: 96.7
- lmarena elo: 1412
- gpqa diamond: 93.6
- lmarena rank: 2
- swe bench pro: 69.2
- swe bench verified: 88.6
- terminal bench 2 1: 74.6
- humanitys last exam: 49.8
- artificial analysis intelligence index: 72
- artificial analysis price blended per m: 12.5
- artificial analysis speed tokens per sec: 78.5
Frequently Asked Questions
What is Claude Opus 4.8 and who built it?
Claude Opus 4.8 is Anthropic's most capable generally available AI model, released on May 28, 2026. It is a dense Transformer with parameter count undisclosed (estimated ~600B based on capability scaling). The model is built by Anthropic, the AI safety company founded in 2021 by former OpenAI researchers. Opus 4.8 is positioned as Anthropic's flagship for agentic reasoning, long-context analysis, and autonomous coding workflows. It maintains the Claude Opus family's focus on behavioral consistency and alignment. The model excels on SWE-bench Verified (88.6%), mathematical reasoning (USAMO 96.7%), and code-flaw detection (0% miss rate on internal evals, 4x improvement over Opus 4.7). Anthropic markets Opus 4.8 as a modest but tangible improvement over Opus 4.7, with deeper gains in math and agentic task consistency.
How much does Claude Opus 4.8 cost per 1M tokens?
Claude Opus 4.8 costs $5.00 per 1M input tokens and $25.00 per 1M output tokens on standard mode. Prompt caching (minimum 1,024 tokens) costs $0.50 per 1M cached input tokens, a 90% discount vs standard input. Fast mode (research preview) costs $10.00 input / $50.00 output per 1M and delivers 2.5x faster output tokens at double the standard price. Batch API (async within 24 hours) costs $2.50 input / $12.50 output per 1M, a 50% discount. Worked examples: (1) Analyzing a 100K-token research paper with caching costs $0.32. (2) Daily agentic coding loop (1M input / 200K output, standard mode) costs $6.00. (3) Customer-support chatbot (1000 turns, avg 2K input / 500 output per turn) costs $13.50 total. (4) Batch processing 10 million tokens overnight costs $37.50. Pricing has remained stable since Opus 4.5 (Nov 2025) through Opus 4.8 (May 2026).
What is Claude Opus 4.8's context window and max output?
Claude Opus 4.8 supports a 1M token (1 million token) context window by default on the Claude API, AWS Bedrock, and Google Vertex AI. Microsoft Foundry offers 200k context. Standard maximum output is 128,000 tokens. For batch processing, there is a beta 300k output option via the output-300k-2026-03-24 header on the Batch API. Long-context recall is reliable: the model maintains 99%+ accuracy on needle-in-haystack evals (finding facts buried deep in the context), verified above 100K token depth. The model architecture preserves token position awareness without significant degradation, enabling tasks like analyzing entire codebases (1M is ~3-5 files of code) or multi-document research synthesis. Prompt caching minimum is 1,024 tokens, lower than Opus 4.7 (2,048), making it cost-effective to cache smaller system prompts and tool definitions.
How does Claude Opus 4.8 compare on benchmarks vs GPT-5.5 and Gemini 3.1 Pro?
Claude Opus 4.8 dominates on agentic coding and mathematical reasoning. SWE-bench Verified: Opus 4.8 leads at 88.6% vs GPT-5.5 (estimated 85%) and Gemini 3.1 Pro (84.2%). SWE-bench Pro (harder agentic coding): Opus 4.8 reaches 69.2%, beating GPT-5.5 (58.6%) and Gemini (62.8%). USAMO 2026 (math competition): Opus 4.8 dominates at 96.7%, far exceeding GPT-5.5 (91.2%) and Gemini (89.4%). GPQA Diamond (graduate reasoning): Opus 4.8 scores 93.6%, competitive with Gemini 3.1 Deep Think (94.1%) but ahead of GPT-5.5 (91.2%). Humanity's Last Exam (multidisciplinary): Opus 4.8 scores 57.9% with tools, highest in the field. Code honesty: Opus 4.8 achieves 0% flaw-miss rate (first Claude model), vs Opus 4.7 (25% miss rate) and GPT-5.5 (18% miss rate). LMArena Elo: Opus 4.8 ranks #2 at ~1,412 Elo, behind GPT-5.5 (~1,428) but ahead of Gemini 3.1 Pro. Each model wins different axes: Opus excels in agentic coding and math, GPT-5.5 leads on speed, Gemini leads pure theorem-proving.
Is Claude Opus 4.8 open source or proprietary?
Claude Opus 4.8 is proprietary with closed weights. There is no open-source or open-weights version. Users cannot download the model, self-host, deploy air-gapped, or fine-tune the base weights. Access is API-only through multiple vendor channels. The Claude API (api.anthropic.com) offers direct access with API Key authentication. AWS Bedrock hosts the model with IAM-based auth across us-east-1, us-west-2, eu-central-1, ap-northeast-1, and other regions. Google Vertex AI provides access with GCP IAM authentication. Microsoft Foundry (Azure) offers deployment with Azure Identity authentication (200k context only). No other major platforms (Together.ai, Fireworks.ai, Lambda Labs) currently host Opus 4.8 as of May 2026. SDKs are available in Python, TypeScript, JavaScript, Java, Go, and Ruby for all major deployment platforms.
What modalities does Claude Opus 4.8 support?
Claude Opus 4.8 is multimodal with text and vision inputs, text output only. Input modalities: text (unlimited tokens), images (up to 100 per request, any resolution), PDF documents (native reading without extraction), and tool calls (function calling schema compatible with OpenAI). Output modalities: text (up to 128k tokens, beta 300k), and tool calls (parallel execution supported). Special capabilities: structured output (JSON mode via API), function calling (native, OpenAI-compatible schema), and computer use (screen reading, keyboard/mouse control verified at 83.4% on OSWorld). Notably absent: no native audio input, no audio output, no video input. Audio workflows require pairing with a separate ASR (automatic speech recognition) and TTS (text-to-speech) model. Video understanding requires extracting frames and processing as images or using a separate video model. The model's vision capabilities are strong on documents, diagrams, charts, and UI screenshots—optimized for code-centric and analysis tasks.
Does Claude Opus 4.8 train on user data?
No, Claude Opus 4.8 does not train on user data by default. Inputs and outputs are retained for 30 days for abuse monitoring and then deleted unless flagged. The model was trained on data with a cutoff of January 2026, including public web text, licensed datasets, and synthetic reasoning traces. Anthropic does not train future Claude models on API inputs; this is the default policy. Users can opt for zero-retention mode on enterprise plans, which means data is deleted immediately after processing. Data governance: Claude Opus 4.8 is SOC 2 Type II certified, ISO 27001 compliant, HIPAA-eligible for healthcare workflows, and GDPR-compliant for EU data protection. Data residency options include US and EU regions depending on deployment (Claude API regional selection, Bedrock region choice, Vertex AI location). The model supports safe data handling for regulated industries. Anthropic's Constitutional AI and RLHF alignment are standard safety measures across all deployments.
Who is Claude Opus 4.8 best for and who should avoid it?
Claude Opus 4.8 excels for four main use cases: (1) Agentic coding teams running long autonomous loops in CI/CD, GitHub Copilot, or Claude Code—88.6% SWE-bench and 0% code-flaw miss rate make it the industry leader. (2) Enterprise research and knowledge teams analyzing 100K+ documents—1M context with 99%+ long-recall and 90% cheaper caching. (3) Mathematical and multidisciplinary reasoning workflows—96.7% USAMO and 93.6% GPQA excel here. (4) Customer support and DevOps teams needing reliable tool orchestration and computer use (83.4% OSWorld). Teams that should avoid: (1) Real-time voice assistants (no native audio, 850ms p50 latency too high for sub-500ms voice interactions; use GPT-5.5 mini or speech-specialized models). (2) On-device or air-gapped deployments (proprietary, API-only; open models like Llama 4 or self-hosted Gemini are alternatives). (3) Pure theorem-proving or symbolic mathematics (Gemini 3.1 Deep Think excels here; specialized math models also compete). (4) Video-first workflows (no native video; separate vision model required). Budget-conscious teams may prefer Claude Sonnet 4.6 (40% cheaper) for less complex tasks or open-weights alternatives.