Sakana Fugu: 95.5% GPQA, $5/M Tokens (2026 Review)
Sakana Fugu: multi-agent orchestration model by Sakana AI, launched June 2026. GPQA 95.5%, SWE-Bench Pro 73.7%. Fugu Ultra: $5/$30 per 1M tokens, 1M context.
Sakana Fugu is Sakana AI's orchestration model (released June 22, 2026) that routes queries across a swappable pool of frontier LLMs via a single OpenAI-compatible API, scoring 95.5% on GPQA Diamond and 73.7% on SWE-Bench Pro (both vendor-reported). Fugu Ultra is priced at $5 per million input tokens and $30 per million output tokens, with a 1,000,000-token context window for demanding multi-step tasks.
Sakana Fugu is an orchestration model released June 22, 2026 by Tokyo-based Sakana AI. Unlike a single dense model, it routes each query to the optimal combination of frontier LLMs using TRINITY and Conductor coordination frameworks. Fugu Ultra costs $5 per million input tokens and $30 per million output tokens. It scored 95.5% on GPQA Diamond and 73.7% on SWE-Bench Pro, both vendor-reported.
Provider: Sakana AI · Family: Sakana Fugu
Context window: 1,000,000 tokens · Max output: 128,000
Input modalities: text, tool-calls · Output: text, tool-calls
About Sakana Fugu
Sakana Fugu is a multi-agent orchestration model released on June 22, 2026 by Sakana AI, a Tokyo-based startup. Unlike a standard single-model LLM, Fugu is itself a language model trained to coordinate a pool of other frontier LLMs. When you send a request to the Fugu API, Fugu decides whether to answer directly or to break the problem into subtasks, delegate those subtasks to a team of specialist models, verify the outputs, and synthesize one final response. The product exposes two tiers: Fugu (optimized for lower latency on everyday tasks) and Fugu Ultra (tuned for maximum accuracy on hard, multi-step problems with a 1,000,000-token extended context window). The architecture is grounded in two research frameworks Sakana published at ICLR 2026. TRINITY uses a lightweight evolved coordinator that assigns roles to each model in the pool: Thinker (plans the approach), Worker (executes subtasks), and Verifier (checks outputs). The coordinator learns how to allocate a limited turn budget across these roles rather than following a fixed workflow. Conductor is a complementary framework that learns natural-language coordination strategies via reinforcement learning, generating custom instructions for how each sub-agent should communicate and what prior subtask context it should see. On benchmarks, Fugu Ultra scored 95.5% on GPQA Diamond (vs Claude Opus 4.8's 92.0%, GPT-5.5's 93.6%, Gemini 3.1 Pro's 94.3%), 73.7% on SWE-Bench Pro (vs Opus 4.8's 69.2%, GPT-5.5's 58.6%, Gemini 3.1 Pro's 54.2%), 93.2% on LiveCodeBench (vs Gemini 3.1 Pro's 88.5%), and 50.0% on Humanity's Last Exam (vs Opus 4.8's 49.8%). All scores are vendor-reported in Sakana's June 2026 technical report and have not been independently reproduced by third-party labs. The orchestration nature of Fugu means these scores reflect the capability of the whole coordinated system, not a set of single-model weights. Latency is the trade-off. On simple prompts, Fugu adds only a small orchestration overhead and behaves like a single fast model. On hard prompts, multiple models are working in parallel or sequence before any output reaches the user, and latency can range from 11 seconds to over four minutes depending on problem complexity. The standard Fugu tier is the right choice for interactive work (chat, code review, quick searches). Fugu Ultra is for batch-mode research, complex engineering, and multi-step analysis where answer quality matters more than response time. Fugu Ultra is priced at $5.00 per million input tokens and $30.00 per million output tokens for contexts up to 272,000 tokens. For contexts above 272K (up to the 1M maximum), the rate rises to $10.00 input and $45.00 output per million tokens, with cached input at $1.00 per million. The base Fugu tier charges at the underlying frontier model rate with no stacking: if multiple models are active, you pay the rate for the single most expensive model involved. Subscription plans are also available: Standard ($20/month), Pro ($100/month), and Max ($200/month), all including access to both Fugu and Fugu Ultra. The API is OpenAI-compatible and served at https://api.sakana.ai/v1. Users authenticate with a Sakana API key. The model pool can be customized per API key: when creating or editing a key in the console at console.sakana.ai, users can specify which providers Fugu is allowed to route to. This is the primary mechanism for sovereignty control, allowing Japan-based enterprises to exclude specific US-origin models if needed. There is currently no Fugu deployment on AWS Bedrock, Google Vertex AI, or Azure as of June 2026. Fugu was designed explicitly to operate without dependency on any single frontier model, specifically addressing the risk that Japan-based enterprises could lose access to Anthropic, OpenAI, or Google models under evolving US export controls. The underlying model pool is swappable without any API change, so if one provider restricts access, Sakana can substitute another. This sovereign positioning is unusual among orchestration systems and has been highlighted by Sakana as a core design goal rather than a side effect. Sakana Fugu is best suited for teams that need GPQA-class reasoning or coding quality but cannot tolerate single-model API risk, or for tasks that genuinely benefit from multi-model verification: research synthesis, legal analysis, scientific reasoning, and complex engineering. It is not suitable for real-time voice assistants (latency variance is too high), for teams that require fully on-premise or air-gapped deployment (the model pool connects externally), or for workloads where per-token cost must be minimized (Fugu Ultra at $30 output per 1M is on the expensive end of the frontier).
Pricing
Fugu Ultra standard tier: $5 input / $30 output / $0.50 cached per 1M tokens (up to 272K context). Extended context (272K to 1M): $10 / $45 / $1.00 per 1M tokens. Base Fugu tier charges the underlying model rate. Subscriptions: $20/month Standard, $100/month Pro, $200/month Max.
Key Features
- Single-endpoint multi-model orchestration: One OpenAI-compatible API call internally routes to the optimal combination of frontier LLMs via the TRINITY and Conductor frameworks, returning one synthesized answer.
- GPQA Diamond 95.5% (vendor-reported): On the graduate-level reasoning benchmark GPQA Diamond, Fugu Ultra scored 95.5% in Sakana's June 2026 technical report, ahead of Gemini 3.1 Pro (94.3%), GPT-5.5 (93.6%), and Claude Opus 4.8 (92.0%).
- 1,000,000-token extended context window: Fugu Ultra supports up to 1 million tokens of context at $10/$45 per 1M input/output tokens, with 272K available at the standard $5/$30 rate.
- Swappable sovereign model pool: Users can lock Fugu to a specific set of underlying providers via the Sakana console, excluding export-controlled US models if required by local regulation.
- No-stacking flat pricing on multi-model calls: When multiple models work together on a request, you pay the rate of the single most expensive model involved, not a sum of all models used.
Pros
- 95.5% GPQA Diamond (vendor-reported) matches or beats every frontier single-model alternative published as of June 2026.
- Sovereign model pool lets Japan-based enterprises restrict routing to non-export-controlled providers without any API code changes.
- Fully OpenAI-compatible API means zero migration cost for teams already using the OpenAI SDK.
Cons
- Latency variance is severe on hard tasks: 11 seconds to 4+ minutes before a response arrives.
- No independent benchmark verification as of June 2026; all scores are Sakana-reported only.
- No deployment on AWS Bedrock, Google Vertex, or Azure; API-only via Sakana's own endpoint.
Benchmarks
- gpqa diamond: 95.5
- swe bench pro: 73.7
- livecode bench: 93.2
- terminal bench: 82.1
- humanitys last exam: 50
Frequently Asked Questions
What is Sakana Fugu and who built it?
Sakana Fugu is a multi-agent orchestration model released June 22, 2026 by Sakana AI, a Tokyo-based startup founded in July 2023. Unlike a standard single-model LLM, Fugu is itself a trained language model that coordinates a pool of frontier LLMs: it decides whether to answer a query directly or break it into subtasks, delegate to specialist models, verify outputs, and synthesize one response behind a single OpenAI-compatible API endpoint. The system is built on two ICLR 2026 papers: TRINITY (which assigns Thinker, Worker, and Verifier roles to models in the pool) and Conductor (which learns natural-language coordination strategies via reinforcement learning). Fugu comes in two variants: Fugu (optimized for lower latency on everyday tasks) and Fugu Ultra (tuned for maximum accuracy on hard multi-step problems, with a 1,000,000-token context window). Fugu Ultra scored 95.5% on GPQA Diamond and 73.7% on SWE-Bench Pro in Sakana's June 2026 technical report; both scores are vendor-reported and not yet independently verified. The model is built by the same team behind The AI Scientist (published in Nature) and AB-MCTS (NeurIPS 2025 Spotlight).
How much does Sakana Fugu cost per 1M tokens?
Sakana Fugu Ultra is priced at $5.00 per million input tokens and $30.00 per million output tokens for contexts up to 272,000 tokens, with cached input at $0.50 per million tokens. For extended context requests (272K to 1,000,000 tokens), the rate rises to $10.00 input and $45.00 output per million tokens. The base Fugu tier does not have a flat rate: it charges at the rate of whichever underlying frontier model handles the request, and when multiple models are active, Sakana only charges for the most expensive one (no stacking). Subscription plans are available at $20/month (Standard), $100/month (Pro, roughly 10x the Standard usage allowance), and $200/month (Max, 20x Standard). A practical cost example: 10 Fugu Ultra requests averaging 5,000 input tokens and 2,000 output tokens each would cost roughly $0.31. A daily engineering loop of 1 million input and 200,000 output tokens via Fugu Ultra would run approximately $11. Compared to Claude Opus 4.8 ($5 input / $25 output per 1M tokens), Fugu Ultra's output rate ($30) is meaningfully higher.
What is Sakana Fugu's context window and max output?
Sakana Fugu Ultra supports a maximum context window of 1,000,000 tokens. The standard pricing tier covers contexts up to 272,000 tokens ($5/$30 per 1M input/output). Requests exceeding 272K tokens shift to the extended context rate ($10/$45 per 1M). The maximum output is 128,000 tokens. Compared to competing models, the 1M context window matches Claude Opus 4.8's maximum context, while GPT-5.5 and Gemini 3.1 Pro also offer large context options. Sakana has not published a needle-in-haystack or long-context recall eval for Fugu Ultra as of June 2026, so long-context recall quality at the full 1M limit has not been independently characterized. The standard Fugu tier's context window is not separately specified in available documentation and likely matches the context of the underlying model chosen for each request.
How does Sakana Fugu compare on benchmarks vs Claude Opus 4.8?
On GPQA Diamond (graduate-level reasoning), Fugu Ultra scored 95.5% vs Claude Opus 4.8's 92.0%, a 3.5-point gap in Fugu's favor. On SWE-Bench Pro (Sakana's engineering evaluation), Fugu Ultra scored 73.7% vs Opus 4.8's 69.2%, a 4.5-point gap. On Humanity's Last Exam, Fugu Ultra scored 50.0% vs Opus 4.8's 49.8%, essentially tied. On LiveCodeBench (competitive coding), Fugu scored 93.2% vs Gemini 3.1 Pro's 88.5%. However, all Fugu scores are reported in Sakana's own June 2026 technical report and have not been independently reproduced. SWE-Bench Pro and the standard SWE-bench Verified are different evaluations: Opus 4.8's published SWE-bench Verified score (88.6%) and Fugu's SWE-Bench Pro score (73.7%) are not directly comparable numbers. Claude Opus 4.8 also has independently verified benchmark results and lower latency, while Fugu Ultra is the faster-latency-variance choice only if multi-model coordination quality is more important than response-time predictability.
Is Sakana Fugu open source or proprietary?
Sakana Fugu is proprietary and API-only. There are no open weights, no Hugging Face repository, and no option to self-host or run the model air-gapped. The Fugu API is served at https://api.sakana.ai/v1 and is OpenAI-compatible, meaning existing OpenAI SDK code works with only a base URL and API key change. There is no deployment of Fugu on AWS Bedrock, Google Vertex AI, or Microsoft Azure as of June 2026. Sakana AI has not published a commercial license document separately from its standard terms; the license is proprietary. For teams that need open-weights models for on-premise deployment, alternatives include Llama 4 (Meta AI, open-weights) or Mistral AI's open-source models. The underlying models in Fugu's pool are third-party proprietary models; Sakana does not publish which models are in the pool by default.
What modalities does Sakana Fugu support?
Sakana Fugu supports text input and text output, plus tool-calls for function calling and structured outputs. The API is OpenAI-compatible and supports the standard messages format with system, user, and assistant roles. As of June 2026, Fugu does not accept image, audio, video, or PDF inputs natively; it is a text-in, text-out orchestration model. Structured output via tool-calls is supported, though Sakana has not published detailed documentation on supported tool schemas or parallel tool-call behavior. Code generation is a supported use case given the SWE-Bench Pro score, but there is no native code execution sandbox within the Fugu API itself. Compared to Claude Opus 4.8 (which supports vision and PDF inputs) or GPT-5.5 (which supports multimodal inputs), Fugu is text-only in its current form, which limits its applicability for workflows requiring image or document understanding.
Does Sakana Fugu train on user data?
Sakana AI has not published a data retention policy, privacy policy, or trust center as of June 2026. The company has not publicly confirmed whether API inputs are used for training future models, what the default data retention period is, or whether an enterprise zero-retention option exists. The API is based in Japan (ap-northeast-1 region), which may be relevant for Japanese data sovereignty requirements, but no formal data residency commitment has been documented. Sakana Fugu has not confirmed SOC 2 Type II, ISO 27001, HIPAA eligibility, or GDPR compliance certifications. Enterprise customers with strict data governance requirements should contact Sakana AI directly to request a data processing agreement (DPA) or security documentation before using the API in production. This is a notable gap compared to Anthropic (SOC 2 Type II, HIPAA, GDPR), OpenAI (SOC 2 Type II, GDPR), and Google Vertex (all major certs).
Who is Sakana Fugu best for and who should avoid it?
Sakana Fugu is best for enterprise teams in Japan needing export-control-resilient LLM routing (the swappable model pool can exclude US-restricted providers without API changes), research teams running complex multi-step reasoning where GPQA-class quality is needed, and engineering teams already using the OpenAI SDK who want a frontier-level coding alternative with a drop-in endpoint change. It is also suitable for batch workflows where response latency can exceed minutes. Teams that should avoid Fugu Ultra include real-time chat or voice applications (11-second to 4-minute latency variance is disqualifying), cost-sensitive high-volume API consumers ($30 output per 1M tokens is among the highest in the market), and enterprises requiring SOC 2, HIPAA, or GDPR certification before procurement (none confirmed as of June 2026). For latency-sensitive needs, Claude Opus 4.8 or GPT-5.5 are better choices. For cost-sensitive bulk inference, GPT-4o mini or Gemini Flash are significantly cheaper. For open-weights or self-hosting requirements, Llama 4 or Mistral Large are the right alternatives.