Mistral Medium 3 Review: 92% HumanEval, $0.40/M (2026)

Mistral Medium 3 (May 2025) scores 92.1% HumanEval, 77.2% MMLU-Pro with a 128K context window. Priced at $0.40/$2.00 per 1M tokens; superseded by Medium 3.5.

Mistral Medium 3 (mistral-medium-2505) is Mistral AI's May 7, 2025 flagship release, hitting 92.1% on HumanEval, 77.2% on MMLU-Pro, and 57.1% on GPQA Diamond with a 128K context window. At $0.40 per 1M input tokens and $2.00 per 1M output tokens, about 8x cheaper than Claude Sonnet 3.7, it has since been superseded by Medium 3.1 (August 2025) and Medium 3.5 (April 2026).

Mistral Medium 3, released by Mistral AI on May 7, 2025, scored 92.1% on HumanEval and 77.2% on MMLU-Pro with a 128K-token context window. Priced at $0.40 per million input tokens and $2.00 per million output tokens, roughly 8x cheaper than Claude Sonnet 3.7. It was superseded by Mistral Medium 3.5 in April 2026.

Provider: Mistral AI · Family: Mistral Medium

Context window: 128,000 tokens

Input modalities: text, image · Output: text

About Mistral Medium 3

Mistral Medium 3 (API name mistral-medium-2505) is a frontier-class multimodal model released by Mistral AI on May 7, 2025. It sits in the middle of Mistral's lineup, positioned below the flagship Large tier but designed to deliver most of the capability of much larger proprietary models at a fraction of the cost. Mistral has not disclosed the exact parameter count or whether the model uses a dense or mixture-of-experts architecture, but based on Mistral's naming conventions for the 2025-2026 generation, dense models occupy the Small and Medium tiers while Mixture-of-Experts is reserved for Large. Mistral pitched the launch under the tagline 'medium is the new large,' framing the model as proof that mid-sized models can now do work that previously required flagship-class compute. On benchmarks, Mistral Medium 3 scored 92.1% on HumanEval 0-shot, matching Claude Sonnet 3.7's reported score and outperforming Llama 4 Maverick at 85.4%. On Math500 Instruct 0-shot it reached 91.0%, ahead of GPT-4o's 76.4%. On MMLU-Pro 5-shot with chain-of-thought it scored 77.2%, again ahead of GPT-4o's 75.8%. On ArenaHard 0-shot it reached 97.1% versus GPT-4o's 95.4%. The model's weakest reported area is graduate-level reasoning: GPQA Diamond 5-shot CoT came in at 57.1%, well behind frontier reasoning-tuned models released in 2026. Mistral did not publish SWE-bench, AIME, or ARC-AGI scores at launch, a transparency gap that drew criticism from independent evaluators. The model ships with a 128,000-token context window, confirmed in Mistral's own model card. On RULER 128K, a long-context retrieval benchmark, it scored 0.902, ahead of GPT-4o's 0.889 at the same context length. Maximum output token limits were not separately published; output tokens count against the same 128K budget as input. Independent evaluation by Artificial Analysis recorded an output speed of 36.8 tokens per second and a time-to-first-token of 1.53 seconds, both measured against the hosted La Plateforme endpoint. Mistral Medium 3 accepts text and image input and produces text output. On DocVQA, a document visual question-answering benchmark, it scored 0.953, and on MMMU (multimodal multitask understanding) it scored 0.661. The model supports function calling and tool use, structured JSON outputs, fill-in-the-middle completions, document OCR and document Q&A, and Mistral's Agents and Conversations APIs. It is a non-reasoning model: there is no extended-thinking or configurable reasoning-effort mode, which distinguishes it from Mistral Medium 3.5's later 'reasoning effort' parameter. Pricing is $0.40 per million input tokens and $2.00 per million output tokens, which Mistral marketed as an 8x cost reduction versus Claude Sonnet 3.7's $3.00/$15.00 per million tokens at the time of launch. Artificial Analysis lists a blended price of roughly $0.56 per million tokens using a 7:2:1 input:output:cached weighting. A document-heavy workload processing 500K input tokens and generating 50K output tokens per day would cost roughly $0.30/day; a coding agent loop running 2M input and 400K output tokens per day would cost around $1.60/day, both an order of magnitude cheaper than equivalent Claude Sonnet 3.7 usage. Mistral Medium 3 is available via Mistral's La Plateforme API and Amazon SageMaker at launch, with Amazon Bedrock, Azure AI Foundry, Google Cloud Vertex AI, IBM watsonx, and NVIDIA NIM listed as additional or follow-on deployment targets. For enterprises that need on-premises or VPC deployment, Mistral states the model can be self-hosted on four GPUs and above, and supports continuous pretraining and fine-tuning for domain adaptation. Mistral AI is headquartered in Paris and operates under EU data protection law by default, which gives Mistral Medium 3 a GDPR-aligned baseline that some US-based competitors require an enterprise tier to match. Mistral's published policy is that API inputs are not used to train future models unless a customer opts in. Mistral has not published a dedicated system card with HarmBench or jailbreak-resistance figures for Medium 3 specifically, and safety evaluation details for this model are thinner than for Mistral's later releases. Mistral Medium 3 is the right choice for teams building cost-sensitive coding assistants, document-processing pipelines that need OCR plus long-context retrieval, and enterprises that want a capable multimodal model they can run in their own VPC on a handful of GPUs. It is the wrong choice for anything that depends on frontier-level scientific reasoning (GPQA Diamond 57.1% is a real gap), for latency-sensitive real-time chat given its sub-median 36.8 tokens/second output speed, and for any new project expecting long-term API stability, since Mistral has already shipped two successors. The model line moved quickly after this release: Mistral Medium 3.1 (mistral-medium-2508) shipped just three months later in August 2025 as a refreshed frontier-class multimodal model, and Mistral Medium 3.5 (mistral-medium-3-5) followed in April 2026 as a 128B dense model with a 256K context window, native reasoning-effort controls, and stronger agentic coding scores. Mistral Medium 3 has a published deprecation timeline around May 2026 on at least one cloud listing, so new integrations should target Medium 3.5 while existing mistral-medium-2505 deployments plan a migration.

Pricing

$0.40 per 1M input tokens, $2.00 per 1M output tokens on Mistral La Plateforme, roughly an 8x discount versus Claude Sonnet 3.7's $3/$15 per 1M at launch. Artificial Analysis lists a blended price near $0.56 per 1M tokens.

Key Features

Pros

Cons

Benchmarks

Frequently Asked Questions

What is Mistral Medium 3 and who built it?

Mistral Medium 3 (API name mistral-medium-2505) is a frontier-class multimodal model built by Mistral AI, a Paris-based lab, and released on May 7, 2025. It sits in the middle of Mistral's lineup, below the Large tier, but was marketed under the tagline 'medium is the new large' for delivering near-flagship performance at a fraction of the cost. Mistral has not disclosed the parameter count or whether the architecture is dense or mixture-of-experts. On launch benchmarks it scored 92.1% on HumanEval, 77.2% on MMLU-Pro, and 91.0% on Math500 Instruct, putting it ahead of GPT-4o and roughly on par with Claude Sonnet 3.7 on coding. It was designed to compete with mid-to-large proprietary models like GPT-4o, Claude Sonnet 3.7, and Llama 4 Maverick while costing a fraction as much per token. Mistral positioned it as the model that brings flagship coding and document-understanding ability into a deployable, self-hostable package.

How much does Mistral Medium 3 cost per 1M tokens?

Mistral Medium 3 costs $0.40 per 1 million input tokens and $2.00 per 1 million output tokens on Mistral's La Plateforme API, Amazon Bedrock, and Azure AI Foundry. Mistral marketed this as an 8x cost reduction versus Claude Sonnet 3.7, which costs $3.00 input and $15.00 output per 1 million tokens. Artificial Analysis lists a blended price of roughly $0.56 per 1 million tokens using a typical 7:2:1 input:output:cache ratio. A document OCR and QA pipeline processing 500K input and 50K output tokens per day would cost about $0.30 per day. A coding agent loop running 2M input and 400K output tokens per day would cost roughly $1.60 per day. No cached-input discount is published for this model. For teams that can self-host, the model runs on four GPUs and above, trading the per-token fee for infrastructure cost.

What is Mistral Medium 3's context window and max output?

Mistral Medium 3 has a 128,000-token context window, confirmed in Mistral's official model card. A separate maximum output token limit is not published; output tokens share the same 128K budget as input tokens. On RULER 128K, a long-context retrieval benchmark, the model scored 0.902, ahead of GPT-4o's reported 0.889 at the same context length, indicating reliable recall near the top of its context window. There is no documented sliding-window behavior or separate extended-context tier for this model. For multi-document workloads, Mistral recommends chunking very large files to leave headroom for output generation within the 128K budget. Compared to Medium 3.5's later 256K window, Medium 3 offers half the context but was the largest in its tier at launch in May 2025.

How does Mistral Medium 3 compare on benchmarks vs Claude Sonnet 3.7?

On HumanEval, Mistral Medium 3 scored 92.1%, matching Claude Sonnet 3.7's reported score on the same benchmark. On Math500 Instruct, Medium 3 scored 91.0%, and on MMLU-Pro it scored 77.2%, both competitive with Sonnet 3.7-class models. However, on GPQA Diamond, a graduate-level science reasoning benchmark, Medium 3 scored only 57.1%, a notable gap versus frontier reasoning-tuned models released later in 2025 and 2026. Mistral did not publish SWE-bench Verified, AIME, or ARC-AGI scores for Medium 3, while Anthropic has published SWE-bench numbers for Claude models, making a direct agentic-coding comparison impossible from public data alone. In practice, the 35-point GPQA gap means Medium 3 is reliable for coding and document tasks but more likely to make mistakes on multi-step scientific or logical reasoning chains than reasoning-focused competitors. The headline result is that Medium 3 matches Sonnet 3.7 on raw coding pass rates at about 1/8th the price, but trails on hard reasoning.

Is Mistral Medium 3 open source or proprietary?

Mistral Medium 3 is proprietary and API-only; its weights are not published. It is licensed under Mistral's Commercial License and accessed via Mistral's La Plateforme API, Amazon Bedrock, Amazon SageMaker, and Azure AI Foundry, with Google Cloud Vertex AI and IBM watsonx listed as additional deployment targets. For organizations needing on-premises or VPC control, Mistral states the model can be self-hosted on four GPUs and above, with support for continuous pretraining and fine-tuning, but this still requires a commercial agreement with Mistral rather than an open download. This differs from some other models in the Mistral 3 family (such as the Apache 2.0-licensed Mistral 3 small/dense models and the open-weight Mistral Medium 3.5, released under a modified MIT license restricting commercial use) which are downloadable from Hugging Face. There is no commercial-use-free path to run Mistral Medium 3 itself.

What modalities does Mistral Medium 3 support?

Mistral Medium 3 accepts text and image input and produces text output. On DocVQA, a document visual question-answering benchmark, it scored 0.953, and on MMMU, a multimodal multitask understanding benchmark, it scored 0.661, both indicating solid image and document comprehension. The model supports function calling and tool use with structured JSON output, fill-in-the-middle code completions, document OCR, document Q&A, and Mistral's Agents and Conversations APIs for multi-step agentic workflows. It does not support audio input or output, or video input; those modalities are handled by separate Mistral models such as Voxtral. Compared to GPT-4o, which supports native audio in the same model, Medium 3 is text-and-image only, so voice applications require pairing it with a separate transcription or speech model.

Does Mistral Medium 3 train on user data?

Mistral's published policy states that API inputs and outputs sent to Mistral Medium 3 are not used to train future models unless a customer explicitly opts in. Mistral AI is headquartered in Paris and operates under EU data protection law by default, giving the model a GDPR-aligned baseline; data_residency_options for the standard API are EU-based. Mistral has not published SOC 2 Type II or ISO 27001 certification details specific to this model's hosting, and the model is not marketed as HIPAA-eligible. Under the EU AI Act, Mistral Medium 3 falls under the general-purpose AI model (GPAI) category, which carries documentation and transparency obligations for the provider. On Amazon Bedrock or Azure AI Foundry, data handling follows each cloud provider's standard tenant-isolation and retention policies layered on top of Mistral's base no-training commitment. No dedicated system card with HarmBench or jailbreak-resistance scores has been published for this specific model.

Who is Mistral Medium 3 best for and who should avoid it?

Mistral Medium 3 is best for engineering teams building cost-sensitive coding assistants who need near-Claude-Sonnet-3.7 HumanEval performance at roughly 1/8th the price, document-processing pipelines that need OCR plus long-context retrieval (DocVQA 0.953, RULER 128K 0.902), and enterprises that want a multimodal model they can self-host on four-plus GPUs for VPC or on-premises requirements. It should be avoided for graduate-level scientific or multi-step logical reasoning, where its 57.1% GPQA Diamond score trails reasoning-tuned competitors; teams should look at Medium 3.5 or a dedicated reasoning model instead. It's also a poor fit for latency-sensitive real-time chat, since Artificial Analysis measured 36.8 tokens/sec output speed against a roughly 94.5 tok/s median for comparable models. Finally, any new long-term integration should target Mistral Medium 3.5 rather than Medium 3, since Medium 3 has already been superseded twice (Medium 3.1 in August 2025, Medium 3.5 in April 2026) and carries a published deprecation timeline around May 2026.

Visit Mistral Medium 3 Official Page