GPT-5.5 Review: 88.7% SWE-bench, $5/M Tokens (2026)
GPT-5.5, released April 2026 by OpenAI, scores 88.7% SWE-bench with 1M context. Native audio/video/image in one model. Priced at $5/$30 per 1M tokens.
GPT-5.5 (codename Spud) is OpenAI's flagship multimodal model, released April 23, 2026, scoring 88.7% on SWE-bench Verified and 92.4% on MMLU with a 1M-token context window. API pricing is $5 input / $30 output per 1M tokens; the GPT-5.5 Pro tier costs $30 input / $180 output per 1M tokens, and the model replaced GPT-5.4 as ChatGPT's default on May 5, 2026.
GPT-5.5, released April 23, 2026 by OpenAI (codename Spud), is the first OpenAI model to process text, image, audio, and video in a single architecture. It scores 88.7% on SWE-bench Verified and 92.4% on MMLU with a 1M-token context window. API pricing is $5 input / $30 output per 1M tokens. It replaced GPT-5.4 as OpenAI's flagship and became ChatGPT's default on May 5, 2026.
Provider: OpenAI · Family: GPT-5
Context window: 1,000,000 tokens · Max output: 128,000
Input modalities: text, image, audio, video, tool-calls · Output: text, tool-calls
About GPT-5.5
GPT-5.5, codename Spud, is OpenAI's sixth-generation flagship multimodal language model, released April 23, 2026, with API access opening April 24. It is the first OpenAI model to process text, images, audio, and video in a single native architecture, removing the prior requirement to stitch GPT-5, Whisper, and Sora behind an agent. The model sits above GPT-5.4 in OpenAI's lineup and targets complex professional work: agentic coding, long-document analysis, multi-modal research, and cross-tool automation. OpenAI has not disclosed the parameter count; the architecture follows the Mixture-of-Experts lineage used in GPT-5, with 128 expert routing and sparse activation per token. GPT-5.5 scores 88.7% on SWE-bench Verified, the agentic software engineering benchmark measuring real bug-fix success in production codebases. On MMLU it reaches 92.4%. For visual reasoning, it posts 92.1% on ChartQA and 96.2% on AI2D (infographic understanding). Terminal-Bench, measuring agentic task completion in a real terminal environment, comes in at 82.7%. BenchLM ranks GPT-5.5 fourth globally among 119 models as of June 2026 with an overall score of 91/100. GPT-5.5 Instant (the lighter default variant) scored 81.2% on AIME 2025, compared to 65.4% for GPT-5.3 Instant. Specific GPQA Diamond and AIME 2025 scores for the full GPT-5.5 model have not been independently published. Against Claude Opus 4.8 (88.6% SWE-bench Verified), GPT-5.5 is statistically even on agentic coding. The API context window is 1,000,000 tokens with a maximum output of 128,000 tokens per completion. The Codex product caps GPT-5.5 at 400,000 tokens. A context surcharge applies when prompts exceed 272,000 tokens: the full session is billed at 2x input and 1.5x output. No publicly available needle-in-haystack evaluation at 1M tokens has been published for GPT-5.5 as of June 2026. GPT-5.5 natively accepts text, images, audio, and video in a single model session, with text and tool-calls as outputs. Computer use is live via Codex's hosted shell and apply-patch tooling. Function calling, structured outputs, MCP server integration, OpenAI Skills, web search, and a hosted shell are all available in the API. Batch API and Flex pricing tiers are live. Canvas was removed in a May 2026 update; writing and code features now surface through native response blocks. Standard API pricing is $5.00 per 1M input tokens and $30.00 per 1M output tokens, with cached input at $0.50 per 1M (90% off the full input rate). The GPT-5.5 Pro tier costs $30.00 input / $180.00 output per 1M tokens. Batch API provides 50% off at $2.50 input / $15.00 output with results within 24 hours. Worked cost examples: a 100K-token coding task with 10K output costs about $0.80; a 1M-token research session crossing the surcharge threshold costs roughly $12 in input alone; a daily coding agent running 1M tokens in and 200K out costs about $11 at standard rates. GPT-5.5 is available via the OpenAI API directly. AWS Bedrock added GPT-5.5 on June 1, 2026, following a $50 billion partnership that ended OpenAI's prior exclusivity with Microsoft Azure. Microsoft retains a non-exclusive license to OpenAI's IP through 2032 and Azure remains an active deployment target. Authentication is via API key for the direct endpoint and AWS IAM on Bedrock. Together AI and Fireworks availability had not been confirmed as of June 9, 2026. The GPT-5.5 System Card was published April 24, 2026, at deploymentsafety.openai.com/gpt-5-5. OpenAI reports that GPT-5.5 Instant produces 52.5% fewer hallucinated claims on high-stakes prompts (medicine, law, finance) compared to GPT-5.3 Instant. Safety training follows reinforcement learning with a long internal chain-of-thought before output, plus safety classifiers applied to filter harmful and personal content. OpenAI's Preparedness Framework (ASL-3 scope) was applied during evaluation. Specific Harmbench refusal rates have not been published for GPT-5.5. GPT-5.5 is the right choice for teams building multimodal agentic workflows where native audio, video, and image processing matter: a single model call replaces separate ASR and vision pipelines. It is strong for complex professional research and long-document analysis at 1M tokens. Teams running text-only cost-sensitive workloads should consider Gemini 2.5 Flash at $0.15/$0.60 per 1M, which is 30x cheaper for pure text. Voice-first applications needing sub-200ms latency should use OpenAI's Realtime API rather than GPT-5.5's standard endpoint, where TTFT runs 400-700ms. On agentic coding benchmarks, Claude Opus 4.8 posts comparable SWE-bench Verified scores and should be evaluated in parallel. Training data includes publicly available internet text, licensed third-party datasets, and human trainer inputs. OpenAI does not train on API inputs by default; enterprise zero-retention is available. The exact training cutoff for GPT-5.5 has not been published; GPT-5.2's was August 31, 2025 as a reference point. SOC 2 Type II certification applies. API inputs sent via AWS Bedrock are governed by AWS data terms separately from OpenAI's direct API policy. GPT-5.5 Instant replaced GPT-5.3 Instant as ChatGPT's default model on May 5, 2026. A May 2026 style update reduced overly long and bullet-heavy responses in the Instant variant. Canvas was deprecated in the same window, with writing and code blocks replacing it natively. GPT-5.5 Pro launched alongside the standard model, targeting the heaviest professional workloads. As of June 2026, community sources reference GPT-5.6 with a rumored 1.5M-token context window.
Pricing
$5 per 1M input, $30 per 1M output. Cached input $0.50 per 1M (90% off). Prompts over 272K tokens billed at 2x input / 1.5x output for the full session. Batch API: $2.50 input / $15 output. GPT-5.5 Pro: $30 input / $180 output per 1M.
Key Features
- Unified Multimodal Architecture: Native text, image, audio, and video processing in one model session. Removes the need to stitch GPT-5, Whisper, and Sora behind an agent.
- 1M-Token Context Window: Largest context in the GPT-5 family. Handles entire codebases or book-length documents in a single API call (API only; Codex caps at 400K).
- Agentic Computer Use: Built-in computer use via Codex hosted shell and apply-patch. Handles multi-step software tasks end-to-end with 88.7% SWE-bench Verified.
- Native MCP and Skills Integration: MCP server integration and OpenAI Skills allow tool-augmented agent loops without custom orchestration code.
- Prompt Caching at 90% Off: Cached input tokens cost $0.50 per 1M (vs $5.00 standard), delivering major cost savings on repeat-context workloads like long system prompts.
Pros
- 88.7% SWE-bench Verified, among the top agentic coding models available via API.
- First OpenAI model with native audio and video input, eliminating separate pipeline integrations.
- Prompt caching at $0.50 per 1M cached input tokens cuts costs by 90% on repeat-context agent loops.
Cons
- Context surcharge past 272K tokens doubles input cost and raises output cost by 1.5x for the full session.
- No native audio output: speech synthesis requires a separate model or the Realtime API.
- At $5/$30 per 1M tokens, it is 30-50x more expensive than Gemini 2.5 Flash for text-only workloads.
Benchmarks
- mmlu: 92.4
- swe bench verified: 88.7
- artificial analysis speed tokens per sec: 60.7
Frequently Asked Questions
What is GPT-5.5 and who built it?
GPT-5.5, codename Spud, is a large multimodal language model developed by OpenAI and released on April 23, 2026, with API access opening April 24. It is built on a Mixture-of-Experts Transformer architecture and is the first OpenAI model to process text, images, audio, and video in a single native architecture rather than stitching together separate models. GPT-5.5 sits above GPT-5.4 in OpenAI's lineup as the flagship system for complex professional work. It scores 88.7% on SWE-bench Verified and 92.4% on MMLU, placing it fourth globally among 119 models on BenchLM as of June 2026. The model supports a 1M-token context window in the API and includes built-in computer use via Codex. It replaced GPT-5.4 as OpenAI's primary frontier model and GPT-5.5 Instant replaced GPT-5.3 Instant as ChatGPT's default on May 5, 2026. Standard API pricing is $5 input / $30 output per 1M tokens.
How much does GPT-5.5 cost per 1M tokens?
Standard API pricing for GPT-5.5 is $5.00 per 1M input tokens and $30.00 per 1M output tokens, with cached input at $0.50 per 1M (a 90% reduction from the full input rate). The Batch API tier provides 50% off at $2.50 input / $15.00 output per 1M, with results delivered asynchronously within 24 hours. The GPT-5.5 Pro tier costs $30.00 input / $180.00 output per 1M and is available only on OpenAI's Pro, Business, and Enterprise ChatGPT plans. A context surcharge activates when prompts exceed 272,000 tokens: the full session is billed at 2x input and 1.5x output. Worked cost examples: a 100K-token coding task with 10K output costs about $0.80; a daily coding agent running 1M tokens in and 200K out costs about $11 at standard rates; a 1M-token research session crossing the surcharge threshold costs roughly $12 in input alone. For comparison, Gemini 2.5 Flash costs $0.15/$0.60 per 1M for text-only tasks, making GPT-5.5 around 30-50x more expensive on pure-text workloads. GPT-5.5 is not available for self-hosting, so infrastructure costs do not apply.
What is GPT-5.5's context window and max output?
GPT-5.5 supports a 1,000,000-token (1M) context window in the API, making it the largest in the GPT-5 family. Maximum output per completion is 128,000 tokens. In the Codex product, the context window is capped at 400,000 tokens; teams requiring full 1M-token sessions must use the API directly via the Responses API endpoint. A context surcharge activates when prompts exceed 272,000 tokens per session: the entire session is then billed at 2x input and 1.5x output pricing. No publicly verified needle-in-haystack evaluation at 1M tokens has been published for GPT-5.5 as of June 2026. For comparison, Claude Opus 4.8 offers 1M context and Gemini 2.5 Pro offers 2M context. For most coding and document tasks under 200K tokens, the practical difference between GPT-5.5 and competitors at the context-window level is minimal; the surcharge matters most for RAG workloads passing full document corpora.
How does GPT-5.5 compare on benchmarks vs Claude Opus 4.8?
GPT-5.5 scores 88.7% on SWE-bench Verified; Claude Opus 4.8 scores 88.6%, making the two models statistically tied on agentic coding. On MMLU, GPT-5.5 posts 92.4%, while Claude Opus 4.8's MMLU-Pro score sits around 81.2% (the tests differ slightly in methodology). For visual reasoning, GPT-5.5 demonstrates 92.1% ChartQA and 96.2% AI2D, benchmarks where Claude's specific scores have not been published head-to-head. Claude Opus 4.8 tops the SWE-bench Pro variant (which measures harder, longer agentic sessions) at 69.2% versus GPT-5.5's unpublished Pro score. On AIME 2025 advanced math, Claude Opus 4.8 posts high scores in the extended-thinking regime; GPT-5.5's full model AIME score has not been published (the Instant variant scored 81.2%). BenchLM places GPT-5.5 at rank 4 and Claude Opus 4.8 at or near the top position. In practice, the choice between them depends more on cost structure, deployment platform, and multimodal requirements than on benchmark deltas of under 1 percentage point.
Is GPT-5.5 open source or proprietary?
GPT-5.5 is fully proprietary: the model weights are closed and it is API-only. There is no HuggingFace release, no VRAM requirement to consider, and no self-hosting path. Access is via the OpenAI API (api.openai.com), AWS Bedrock (available since June 1, 2026), and Azure OpenAI Service (non-exclusive, active through 2032). OpenAI did release open-weight models in 2026 (the GPT-OSS family: gpt-oss-120b and gpt-oss-20b, Apache 2.0), but these are distinct from GPT-5.5 and do not share its architecture or capability level. GPT-5.5 has no commercial self-hosting option, no fine-tuning of the base weights, and no air-gapped deployment path. Enterprise customers can access it with zero-retention data handling via API agreements. For teams requiring on-premises deployment or open weights, Llama 4 or Mistral Medium 3 are alternatives.
What modalities does GPT-5.5 support?
GPT-5.5 accepts text, images, audio, and video as inputs in a single native model session, making it the first OpenAI model to handle all four modalities without routing to separate specialist models. Output modalities are text and tool-calls only: audio output is not available from GPT-5.5 directly. Function calling, structured outputs, MCP server integration, OpenAI Skills, and web search are all live in the API. Computer use is handled via Codex's hosted shell and apply-patch tooling. PDF inputs are supported via the image pipeline. Video processing enables structured summaries with timestamps and action items from meeting recordings and webinars. Audio supports multi-language transcription and translation. For applications requiring audio output (voice assistants, spoken responses), GPT-5.5 must be paired with a TTS model or the OpenAI Realtime API, which handles bidirectional audio at lower latency.
Does GPT-5.5 train on user data?
OpenAI does not train on API inputs by default. Users and organizations can opt out of data retention entirely through enterprise zero-retention agreements. API inputs are retained under OpenAI's standard data policy for a limited period for abuse monitoring unless a zero-retention agreement is active. For AWS Bedrock deployments, data governance follows AWS terms separately from OpenAI's direct API policy, with data processed within the selected AWS region. OpenAI holds SOC 2 Type II certification; HIPAA-eligible configurations are available for enterprise. GDPR compliance applies for EU users. The EU AI Act classifies GPT-5.5 as a general-purpose AI with systemic risk obligations, which requires OpenAI to maintain transparency reporting and adversarial testing documentation. Developers should review the OpenAI data usage policies at openai.com/policies and the specific Bedrock or Azure data terms for their deployment region.
Who is GPT-5.5 best for and who should avoid it?
GPT-5.5 is best for teams building multimodal agentic workflows: native audio, video, and image processing in one model removes the complexity of integrating separate ASR and vision systems. It is strong for complex professional research, long-document analysis (up to 1M tokens), and agentic coding tasks at 88.7% SWE-bench Verified. Organizations already on OpenAI's platform scaling from GPT-5.4 gain a clear upgrade path without SDK changes. Teams running text-only cost-sensitive workloads should avoid GPT-5.5: at $5/$30 per 1M, Gemini 2.5 Flash at $0.15/$0.60 is 30-50x cheaper for pure-text tasks. Voice-first applications needing sub-200ms latency should use the OpenAI Realtime API, not GPT-5.5's standard endpoint where TTFT runs 400-700ms. Air-gapped or on-premises deployments are impossible given the proprietary, API-only model. Teams with strict per-session cost caps should model the 272K-token surcharge before committing, since it can double input costs on document-heavy sessions.