MAI-Image-2.5: Microsoft's #2 Arena Image Editor (2026)

MAI-Image-2.5 is Microsoft AI's 2026 diffusion model, ranked #2 for image editing, #3 text-to-image on Arena. Priced from $19.50 per 1M output tokens.

MAI-Image-2.5 is Microsoft AI's diffusion image model, released June 2, 2026, ranked #2 for image editing (Elo ~1,401) and #3 for text-to-image (Elo ~1,269) on Arena, a 75-point gain over MAI-Image-2. It costs $47 per 1M output tokens standard or $19.50 on the Flash SKU (41% cheaper, 22% faster), and runs on Microsoft Foundry, PowerPoint and OpenRouter.

MAI-Image-2.5, released June 2, 2026 by Microsoft AI, is a diffusion image model ranked #3 for text-to-image and #2 for image editing on Arena (Elo ~1,269 and ~1,401). It costs $5 input / $47 output per 1M tokens standard, or $1.75/$19.50 on the Flash SKU, and ships natively inside PowerPoint.

Provider: Microsoft AI · Family: MAI-Image

Context window: 32,000 tokens

Input modalities: text, image · Output: image

About MAI-Image-2.5

MAI-Image-2.5 is a diffusion-based text-to-image and image-editing model built by Microsoft AI, the in-house model group led by Mustafa Suleyman that Microsoft formed in November 2025 to reduce its reliance on OpenAI. Microsoft announced the model on June 2, 2026 at Build 2026, alongside a batch of seven new MAI models covering voice, transcription, coding and reasoning. MAI-Image-2.5 is the fifth release in Microsoft's image lineage, following MAI-Image-1 (October 2025), MAI-Image-2 (March 2026) and the cost-optimized MAI-Image-2-Efficient, and it ships in two SKUs: the full MAI-Image-2.5 and a faster, cheaper MAI-Image-2.5-Flash variant. On the Arena leaderboard, the industry's crowd-voted comparison of image models, MAI-Image-2.5 placed #3 globally for text-to-image generation with an Elo near 1,269, putting it roughly level with Google's Nano Banana 2. On the separate Arena image-editing leaderboard it placed #2 with an Elo near 1,401, behind GPT-Image-2 but ahead of Nano Banana 2.1, DALL-E 3 and Ideogram 2.0. Microsoft reports a composite Arena gain of 75 points over the previous MAI-Image-2, with the two biggest category jumps in text rendering (+107 points) and cartoon, anime and fantasy styling (+90 points). No FID scores, CLIP scores, or head-to-head numbers against FLUX, Stable Diffusion 3 or Imagen 3/4 have been published, so those comparisons cannot be made with confidence. The model accepts up to 32,000 tokens of text context per request and natively outputs images up to 1024x1024 pixels across seven aspect ratios (1:1, 4:3, 3:4, 16:9, 9:16, 3:2, 2:3). MAI-Image-2.5-Flash is Microsoft's first model in the line to handle both text-to-image and image-to-image workloads in one SKU, supporting localized edits such as replacing a single object, updating in-image text or removing motion blur without touching the rest of the frame. Third-party analysis estimates 10 billion to 50 billion non-embedding parameters, but Microsoft has not disclosed an official parameter count, so that figure should be read as an estimate rather than a confirmed spec. Access runs through Microsoft Foundry (formerly Azure AI Foundry) as the primary developer API, the MAI Playground at playground.microsoft.ai, and PowerPoint's built-in image generation and editing tools, with OneDrive integration rolling out. Third-party routers OpenRouter, Fireworks AI and Baseten also carry the model. Pricing on Foundry is token-based: the standard MAI-Image-2.5 costs 5 USD per 1M text-input tokens, 8 USD per 1M image-input tokens and 47 USD per 1M image-output tokens, while MAI-Image-2.5-Flash runs 1.75 USD per 1M text-input tokens, 1.75 USD per 1M image-input tokens and 19.50 USD per 1M image-output tokens, a price cut of roughly 41 percent against MAI-Image-2. Enterprise customers can also reserve provisioned throughput (PTU) capacity, though Microsoft has not published PTU rates publicly. Microsoft's published model cards for the MAI-Image line describe a two-phase safety evaluation (pre-mitigation and post-mitigation) with prompt- and output-level filtering against violent or gory content, sexual content and nudity, depictions of real public figures, and reproduction of trademarked or protected material. The model cards explicitly recommend human review before using generated images in identity, medical, legal, financial or news contexts, rather than deploying the model autonomously in those workflows. Microsoft Ireland Operations Limited is listed as the EU regulatory contact on the model card. Whether MAI-Image-2.5's API output embeds C2PA content-provenance metadata specifically, versus Microsoft's broader February 2026 rollout of C2PA in Microsoft 365, is not spelled out in the public model card and should be treated as unconfirmed. MAI-Image-2.5 is a strong fit for teams already inside the Microsoft ecosystem: PowerPoint and OneDrive users doing in-document image edits, Azure customers who want commercial imagery and packaging mockups with reliable in-image text, and developers who want a #2-ranked editing model at a lower per-image cost than GPT-Image-2. It is a weaker fit for solo creators or hobbyists who want a one-click consumer app, since the primary access path runs through Azure Foundry's developer tooling rather than a standalone consumer interface comparable to Midjourney or DALL-E 3's ChatGPT integration; Bing Image Creator and Copilot rollout for the 2.5 generation was not confirmed as live at launch. Teams needing the single top-ranked text-to-image model on Arena, rather than a top-three placement, should also check GPT-Image-2 and the unnamed Arena leader before committing. Microsoft has not published a training data cutoff or a detailed description of the training corpus for MAI-Image-2.5 beyond stating the model was trained with curated data selection and evaluation feedback from creative professionals, with content filtering mitigations applied before training. No architecture paper or technical report accompanies the release, and no GitHub repository exists for the MAI-Image line, consistent with Microsoft's closed-weights approach across the MAI family. The release fits a clear strategic pattern: Microsoft restructured its OpenAI partnership in October 2025 to gain the contractual right to pursue frontier model development independently, then stood up the Microsoft AI Superintelligence team the following month. MAI-Image-2.5's Foundry pricing, materially cheaper than licensing GPT-Image-2 at scale, positions it as the image-generation leg of that in-house stack alongside MAI-Voice-2 for speech and MAI-Thinking-1 for reasoning.

Pricing

Microsoft Foundry bills MAI-Image-2.5 per token: $5 per 1M text-input tokens, $8 per 1M image-input tokens, $47 per 1M image-output tokens. The Flash SKU is cheaper: $1.75/1.75/$19.50 per 1M for text-input/image-input/image-output, about 41% below MAI-Image-2. Provisioned throughput (PTU) reservations exist for enterprise workloads but rates are not publicly disclosed.

Key Features

Pros

Cons

Benchmarks

Frequently Asked Questions

What is MAI-Image-2.5 and who built it?

MAI-Image-2.5 is a diffusion-based text-to-image and image-editing model built by Microsoft AI, the in-house model group Microsoft formed in November 2025 under CEO Mustafa Suleyman to reduce its dependence on OpenAI. Microsoft announced it on June 2, 2026 at Build 2026, alongside seven other new MAI models covering voice, transcription, coding and reasoning. It is the fifth release in Microsoft's image-model lineage, following MAI-Image-1 (October 2025), MAI-Image-2 (March 2026) and MAI-Image-2-Efficient. The model ships in two SKUs: the full MAI-Image-2.5 and the faster, cheaper MAI-Image-2.5-Flash. On Arena, the crowd-voted image model leaderboard, it ranks #3 for text-to-image (Elo ~1,269) and #2 for image editing (Elo ~1,401), a 75-point composite gain over MAI-Image-2. It was designed specifically to compete with GPT-Image-2 and Google's Nano Banana line while giving Microsoft an owned alternative to licensing OpenAI's image models. No architecture paper has been published, and Microsoft has not disclosed an official parameter count.

How much does MAI-Image-2.5 cost per 1M tokens?

On Microsoft Foundry, the standard MAI-Image-2.5 SKU costs $5.00 per 1M text-input tokens, $8.00 per 1M image-input tokens, and $47.00 per 1M image-output tokens. The Flash SKU is cheaper across the board: $1.75 per 1M text-input tokens, $1.75 per 1M image-input tokens, and $19.50 per 1M image-output tokens, roughly 41% below MAI-Image-2's pricing. Generating 100 marketing images (about 1M output tokens) costs roughly $47 on the standard SKU or $19.50 on Flash. A 500-image batch edit job on Flash (about 2M output tokens) runs about $39. By comparison, GPT-Image-2 access typically costs more per image at equivalent quality tiers, which is part of Microsoft's cost pitch for the MAI line. Enterprise customers can reserve provisioned throughput (PTU) capacity for predictable workloads, though Microsoft has not published PTU rates publicly. There is no flat per-image consumer price; all billing runs through the token-based Foundry API.

What is MAI-Image-2.5's context window and output resolution?

MAI-Image-2.5 accepts up to 32,000 tokens of text context per request, enough for detailed prompts, style references and multi-step edit instructions in a single call. It natively outputs images up to 1024x1024 pixels across seven supported aspect ratios: 1:1, 4:3, 3:4, 16:9, 9:16, 3:2 and 2:3. Microsoft has not published a separate high-resolution or upscaling tier for MAI-Image-2.5, so 1024x1024 should be treated as the model's native output ceiling rather than a minimum. There is no long-context recall metric to report since this is an image-output model, not a text-generation model; the 32K context figure governs the prompt and edit instructions the model can process, not conversational memory. For document or multi-file inputs, Microsoft has not documented a dedicated PDF or multi-image batch mode beyond single-image editing calls.

How does MAI-Image-2.5 compare on benchmarks vs GPT-Image-2?

On Arena's image-editing leaderboard, GPT-Image-2 holds the #1 spot with MAI-Image-2.5 close behind at #2 (Elo ~1,401), ahead of Google's Nano Banana 2.1, DALL-E 3 and Ideogram 2.0. On the text-to-image leaderboard, MAI-Image-2.5 ranks #3 at Elo ~1,269, level with Google's Nano Banana 2 but behind GPT-Image-2 and the leaderboard's top entry. In practice, a top-2-versus-top-1 gap on editing means MAI-Image-2.5 is competitive but not the clear leader for pure edit-quality workloads, while its pricing (as low as $19.50 per 1M output tokens on Flash) undercuts typical GPT-Image-2 access costs. Neither Microsoft nor independent researchers have published FID scores, CLIP scores, or other standardized metrics comparing the two models directly, so the Arena Elo comparison is the only verifiable benchmark head-to-head available as of this writing. Teams should run their own task-specific evaluation before assuming Arena rank translates directly to their use case.

Is MAI-Image-2.5 open source or proprietary?

MAI-Image-2.5 is fully proprietary. Microsoft has not released model weights, an architecture paper, or a GitHub repository for any model in the MAI-Image line. Access is exclusively through Microsoft's own surfaces: the Microsoft Foundry (Azure AI Foundry) API, the MAI Playground, PowerPoint's built-in image tools, and third-party routers OpenRouter, Fireworks AI and Baseten that resell API access under Microsoft's terms. There is no permissive or restrictive open license to cite, no Hugging Face listing, and no self-hosting option; VRAM and quantization questions do not apply since the weights are never distributed. Commercial use is governed entirely by the Microsoft Foundry Model Terms tied to an Azure subscription, not a separate open-source license.

What modalities does MAI-Image-2.5 support?

MAI-Image-2.5 accepts text prompts and image inputs, and produces image outputs only; it does not generate or accept audio or video. Its defining new capability versus earlier MAI-Image releases is image-to-image editing: given an existing image plus a text instruction, it can replace a single object, update in-image text, or remove motion blur while leaving the rest of the frame untouched, a feature Microsoft calls surgical editing. Text-to-image generation from a prompt alone remains fully supported, as it was in MAI-Image-1 and MAI-Image-2. The model does not support function calling, structured JSON output, or tool use in the way LLMs do, since its sole output modality is a rendered image. There is no confirmed computer-use or agentic-loop capability for this model; it functions as a single-turn image generation and editing endpoint.

Does MAI-Image-2.5 train on user data?

Microsoft has not published a MAI-Image-2.5-specific data retention or training-on-inputs policy beyond its general Microsoft Foundry data handling terms, so this should be treated as unconfirmed rather than assumed. The model's published model card focuses on safety evaluation (a two-phase pre-mitigation and post-mitigation process) rather than data retention specifics. Microsoft's model card lists Microsoft Ireland Operations Limited as the EU regulatory contact, suggesting standard Microsoft compliance infrastructure applies, but no SOC 2, ISO 27001, HIPAA, or GDPR compliance statement specific to this model has been located in public sources. Microsoft began adding C2PA content-provenance metadata to Microsoft 365 content in February 2026, but whether MAI-Image-2.5's API output specifically carries C2PA manifests is not explicitly confirmed in the public model card. Enterprises with strict data-handling requirements should confirm retention and training terms directly with their Microsoft Foundry account team before relying on assumptions.

Who is MAI-Image-2.5 best for and who should avoid it?

MAI-Image-2.5 is best for Microsoft 365 teams doing in-document image generation and editing directly inside PowerPoint, with OneDrive integration rolling out; Azure developers building commercial and packaging imagery where in-image text quality matters, since text rendering improved 107 Arena points over MAI-Image-2; and cost-sensitive teams running high-volume batch edits on the Flash SKU at $19.50 per 1M output tokens. It is a poor fit for independent creators or hobbyists who want a one-click consumer app, since the primary access path runs through Azure/Microsoft Foundry developer tooling rather than a standalone app comparable to Midjourney or DALL-E 3 in ChatGPT, and Bing Image Creator/Copilot rollout for the 2.5 generation was not confirmed live at launch. Teams that specifically need the single top-ranked Arena text-to-image model, rather than a top-three placement, should evaluate GPT-Image-2 first. Researchers or procurement teams that require a disclosed architecture paper, parameter count, or training data cutoff will also find MAI-Image-2.5's documentation thinner than some competitors.

Visit MAI-Image-2.5 Official Page