Phi-4: Microsoft's 14B Open-Source SLM, 84.8% MMLU (2025)
Microsoft's Phi-4 is a 14B open-source SLM (MIT license) scoring 84.8% MMLU and 56.1% GPQA Diamond. Available at $0.065/1M input tokens on Azure AI Foundry.
Phi-4 is Microsoft's 14-billion parameter small language model, released December 12, 2024, scoring 84.8% MMLU and 56.1% GPQA Diamond, outperforming GPT-4o on both benchmarks despite being a 14B model. Under the MIT license and available at $0.065 per 1M input and $0.140 per 1M output tokens via Azure AI Foundry, it runs locally on 12 GB VRAM using Q4_K_M quantization.
Phi-4 is a 14-billion parameter open-source small language model from Microsoft, released December 2024 under the MIT license. It scores 84.8% on MMLU and 56.1% on GPQA Diamond, outperforming GPT-4o on those two benchmarks despite being significantly smaller. Available via Azure AI Foundry at $0.065 per 1 million input tokens and $0.140 per 1 million output tokens, it runs locally on 12 GB of VRAM with Q4 quantization.
Provider: Microsoft · Family: Phi-4
Context window: 16,384 tokens · Max output: 16,384
Input modalities: text · Output: text
About Phi-4
Phi-4 is a 14-billion parameter small language model from Microsoft Research, first released December 12, 2024 via Azure AI Foundry and made publicly available on Hugging Face on January 9, 2025. The architecture is a dense Transformer with no mixture-of-experts routing, fine-tuned for instruction following using supervised fine-tuning (SFT) and direct preference optimization (DPO). Microsoft designed the model to target graduate-level academic reasoning tasks typically associated with much larger models, training it on a curated pipeline of high-quality synthetic data generated by stronger models, filtered public web text, and licensed academic and educational sources. Within the Phi family, Phi-4 at 14B is the largest instruct model; subsequent variants released in 2025 include Phi-4-mini (3.8B, February 2025), Phi-4-multimodal (5.6B, March 2025), and Phi-4-reasoning (14B chain-of-thought, April 2025). Microsoft's December 2024 technical report places Phi-4 at 84.8% on MMLU (academic multitask), 56.1% on GPQA Diamond (graduate-level STEM), 82.6% on HumanEval (Python coding), and 80.4% on MATH (competition mathematics). On GPQA and MATH specifically, Phi-4 outperforms GPT-4o despite having roughly 90 billion fewer parameters, a result Microsoft attributes to the quality of synthetic training data rather than raw parameter scale. Against Qwen-2.5-14B-Instruct, the closest competing 14B instruct model at release, Phi-4 wins on 9 of 12 benchmarks in the technical report comparison table. On Artificial Analysis's composite Intelligence Index, Phi-4 scores 10 against a peer average of 12, indicating it is optimized for narrow academic reasoning rather than broad general-purpose intelligence. The model generates output at 26.2 tokens per second on Azure and 54.8 tokens per second on DeepInfra, below average for non-reasoning models at its size tier. Phi-4's context window is 16,384 tokens (approximately 12,000 words), with a maximum output length also capped at 16,384 tokens. This is a significant constraint relative to modern alternatives: GPT-4o Mini supports 128K tokens, Phi-4-mini-instruct supports 128K tokens, and Gemini 2.0 Flash supports 1 million tokens. There is no extended-context API tier, no sliding window mechanism, and no option to increase the context limit for Phi-4 base. For single-turn reasoning tasks, coding sessions, short document summaries, and question-answering on excerpts, the 16K window is sufficient. Teams processing full research papers, building multi-turn chat applications with long conversation histories, or running RAG pipelines with large document chunks will hit the ceiling routinely. Phi-4 processes text input only and generates text output only; no vision, audio, video, or PDF input is supported in the base model. Native function calling via a structured tool-call API is not available in base Phi-4; Microsoft added function calling support in Phi-4-mini and Phi-4-multimodal, not in the 14B base. Structured JSON output must be prompted manually with explicit format instructions rather than through a dedicated API parameter. The Phi-4-multimodal-instruct model (5.6B, March 2025) is the family member with vision, audio, and function calling, using a Mixture-of-LoRAs technique to handle all three modalities through a shared frozen base model. For agentic workflows or tool-use pipelines, the base Phi-4 is not a drop-in replacement for GPT-4o or Claude without significant additional engineering. Via Azure AI Foundry, Phi-4 costs $0.065 per 1 million input tokens and $0.140 per 1 million output tokens, placing it among the cheapest managed API options for a capable 14B instruct model. For a workload generating 1 million output tokens per day (about 1,400 average-length chat responses at 700 tokens each), daily API cost via Azure is $0.14. DeepInfra offers comparable or lower pricing with faster throughput at 54.8 tokens per second, and OpenRouter aggregates multiple providers with similar rates. Self-hosting via Ollama or vLLM eliminates per-token costs entirely; with Q4_K_M quantization the full 14B model fits in 8.30 GB of VRAM, meaning a single RTX 3080 or RTX 4080 can serve the model at no marginal inference cost beyond hardware. GitHub Models offers free access for individual developers with rate limits. Phi-4 weights are freely downloadable from Hugging Face at microsoft/phi-4 under the MIT license, in SafeTensors format for direct loading with the transformers library. GGUF-quantized variants covering Q4_0, Q4_K_M, Q5_K_M, Q8_0, and W8A8 formats are available from the bartowski and RedHatAI Hugging Face repositories, compatible with Ollama, LM Studio, llama.cpp, and vLLM. For managed cloud access, Azure AI Foundry provides serverless API access with pay-as-you-go pricing, and the NVIDIA NGC Catalog also hosts the model for deployment via NIM. VRAM requirements for self-hosted deployment are approximately 8.30 GB for Q4_K_M, 10 GB for Q5_K_M, and 14+ GB for Q8_0. The model is also accessible via GitHub Models at no cost for development and testing. Microsoft aligned Phi-4 using SFT combined with iterative DPO, drawing on publicly available safety datasets focused on helpfulness and harmlessness plus in-house synthetic datasets targeting specific content safety categories. Before release, the model underwent adversarial evaluation using Microsoft's internal AI Red Team (AIRT), which ran qualitative safety assessments for both average-user and adversarial scenarios. Microsoft's Responsible AI standards govern the model's content restrictions; Phi-4 declines requests for CSAM, weapons of mass destruction information, and similar clearly harmful content. The model does not train on user API inputs; data retention for abuse monitoring on Azure follows Microsoft's standard enterprise service terms, with zero-retention options available for qualifying enterprise contracts. Phi-4 is a strong choice for teams building STEM tutoring, math reasoning, or code explanation tools on a tight budget, where the 16K context limit is manageable and per-query cost needs to be below $0.001. Researchers who need an MIT-licensed 14B model for fine-tuning experiments or academic benchmarking will find the weight access and permissive license straightforward. Privacy-sensitive deployments that cannot send data to external APIs benefit from the ability to run the full model locally on 12 GB of VRAM. Teams should not choose the base Phi-4 if they need function calling, long-context processing above 16K tokens, multimodal input, or a model with more recent knowledge than June 2024; GPT-4o Mini (function calling, 128K context) or Phi-4-mini-instruct (128K context, function calling) are more appropriate depending on whether managed or open-source is preferred. Phi-4's training data has a knowledge cutoff of June 2024, with model training completed between October and November 2024. Unlike many small language models that rely on large-scale noisy web crawls, Microsoft built Phi-4's dataset from high-quality synthetic data generated by stronger models (a textbook-quality approach pioneered in the Phi-1 line), filtered public web pages emphasizing educational and reasoning-dense content, and licensed academic books. This data strategy is credited with the model's ability to reach 84.8% MMLU at just 14B parameters, outperforming the 84.0% achieved by GPT-4 at launch, and beating GPT-4o on GPQA Diamond. The MIT license applies to model weights; Microsoft's Azure service terms govern data handling on managed API endpoints, and self-hosted deployments are entirely within the user's own data environment. Since the initial December 2024 release, the Phi-4 family has expanded rapidly. Phi-4-mini-instruct (3.8B) launched in February 2025 with a 128K context window and native function calling, addressing the two biggest gaps in the original 14B model. Phi-4-multimodal-instruct (5.6B) followed in March 2025, adding vision, audio, and function calling through a Mixture-of-LoRAs architecture. Phi-4-reasoning (14B) arrived in April 2025, adding extended chain-of-thought reasoning that improved performance on AIME 2025 and GPQA Diamond versus the base instruct version. Phi-4-reasoning-vision (15B) was released March 2026, extending reasoning capability to visual inputs for chart and document interpretation. The base Phi-4 14B model has not received architecture updates since its December 2024 launch.
Pricing
$0.065 per 1M input tokens, $0.140 per 1M output tokens via Azure AI Foundry. Self-hosting via Ollama or vLLM is free beyond GPU electricity; weights are MIT licensed. DeepInfra offers similar pricing with faster throughput at 54.8 t/s.
Key Features
- Outperforms GPT-4o on Reasoning Benchmarks: Phi-4 scores 84.8% MMLU and 56.1% GPQA Diamond, beating GPT-4o on both benchmarks despite having fewer than a quarter of the estimated parameters.
- MIT License for Full Commercial Use: Released under the MIT license, Phi-4 weights can be downloaded, fine-tuned, and deployed commercially with no royalties, usage restrictions, or attribution requirements beyond the license notice.
- Local Deployment on Consumer GPUs: The full 14B model runs locally with Q4_K_M quantization in 8.30 GB of VRAM, fitting on a single RTX 3080 or RTX 4080 without cloud API costs.
- Cost-Effective Managed API: Azure AI Foundry pricing of $0.065 per 1M input and $0.140 per 1M output tokens is among the lowest rates for a capable instruct-tuned language model on a managed endpoint.
- Synthetic Data Training Pipeline: Microsoft trained Phi-4 primarily on high-quality synthetic data generated by stronger models, which drives its outsized academic reasoning performance relative to its 14B parameter count.
Pros
- Scores 84.8% MMLU and 56.1% GPQA Diamond, outperforming GPT-4o on both benchmarks at just 14B parameters.
- MIT license allows unrestricted commercial fine-tuning, redistribution, and self-hosting with no royalty fees.
- Runs locally on 8.30 GB VRAM with Q4_K_M quantization, fitting on consumer-grade GPUs like the RTX 3080.
- Azure API pricing of $0.065 per 1M input tokens is among the lowest for a capable instruct model on a managed endpoint.
Cons
- Context window of 16,384 tokens is severely limited compared to Phi-4-mini (128K) and GPT-4o Mini (128K).
- No native function calling or structured JSON output mode in the base model.
- Artificial Analysis Intelligence Index score of 10 is below the peer average of 12, indicating weaker general-purpose breadth outside core reasoning tasks.
- Training data cutoff of June 2024 means the model has no knowledge of events from H2 2024 onward.
Benchmarks
- math: 80.4
- mmlu: 84.8
- humaneval: 82.6
- gpqa diamond: 56.1
- artificial analysis intelligence index: 10
- artificial analysis price blended per m: 0.07
- artificial analysis speed tokens per sec: 26.2
Frequently Asked Questions
What is Phi-4 and who built it?
Phi-4 is a 14-billion parameter small language model developed by Microsoft Research, released December 12, 2024 via Azure AI Foundry and made publicly available on Hugging Face on January 9, 2025 under the MIT license. It uses a dense Transformer architecture trained on a curated mix of synthetic data, filtered public web text, and licensed academic sources, with no mixture-of-experts routing. The model targets graduate-level reasoning, STEM problem solving, and code generation, designed to compete with models three to five times its size. Phi-4 scores 84.8% on MMLU, 56.1% on GPQA Diamond, and 82.6% on HumanEval, outperforming GPT-4o on both GPQA and MATH despite having approximately 90 billion fewer parameters. It sits at the top of the original Phi-4 family, which also includes Phi-4-mini (3.8B), Phi-4-multimodal (5.6B), and Phi-4-reasoning (14B, April 2025). The MIT license makes Phi-4 one of the few high-performing 14B models that can be downloaded, fine-tuned, and deployed commercially at no cost.
How much does Phi-4 cost per 1M tokens in 2025?
Via Azure AI Foundry, Phi-4 is priced at $0.065 per 1 million input tokens and $0.140 per 1 million output tokens, with no batch API discount published at time of writing. On OpenRouter and DeepInfra, pricing is comparable or slightly lower, with DeepInfra offering faster throughput at 54.8 tokens per second. For a workload generating 500,000 output tokens per day (roughly 700 typical assistant responses at 700 tokens each), daily cost via Azure is about $0.07. For a daily volume of 5 million output tokens, the daily cost is approximately $0.70, making it suitable for high-frequency low-latency use cases on a tight budget. Self-hosting via Ollama or vLLM on your own hardware brings API costs to zero, with only GPU electricity as ongoing cost since the MIT license permits commercial self-hosting. GitHub Models also provides free access to Phi-4 for individual development with rate limits.
What is Phi-4's context window and max output?
Phi-4 has a context window of 16,384 tokens, which covers approximately 12,000 words or 25 to 30 pages of standard text. The maximum output length is also 16,384 tokens, meaning the combined input plus output cannot exceed 16,384 tokens in a single request. This is a significant limitation compared to GPT-4o Mini (128K context), Phi-4-mini-instruct (128K context), and Gemini 2.0 Flash (1 million tokens). For single-turn tasks like solving a math problem, summarizing a short article, or explaining a code snippet, the 16K window is adequate. For multi-turn conversations with long histories, RAG pipelines processing full research papers, or document classification at scale, the 16K limit will routinely cause truncation. There is no extended-context tier or sliding window option for the base Phi-4 model; teams needing more context should evaluate Phi-4-mini-instruct, which offers 128K tokens at 3.8B parameters.
How does Phi-4 compare to GPT-4o on benchmarks?
Phi-4 outperforms GPT-4o on GPQA Diamond (56.1% vs 53.6%) and MATH (80.4% vs 76.6%) according to Microsoft's December 2024 technical report, a notable result for a 14B model against a leading frontier model at that time. On MMLU, Phi-4 scores 84.8% versus GPT-4o's 85.7%, a gap of under 1 percentage point. On HumanEval, Phi-4 scores 82.6% versus GPT-4o's 90.2%, a 7.6-point gap that matters for production coding tasks. However, GPT-4o has a 128K context window, native multimodal input (vision, audio, video), built-in function calling, and a much higher overall capability ceiling for complex agentic tasks. On Artificial Analysis's Intelligence Index, Phi-4 scores 10 versus much higher scores for current frontier models, reflecting that the benchmark wins are narrow and task-specific. For pure math and STEM reasoning at low cost, Phi-4 holds up; for production applications requiring long context, tool use, or multimodal input, GPT-4o or GPT-4o Mini offer a more complete feature set.
Is Phi-4 open source?
Phi-4 is released under the MIT license, one of the most permissive open-source software licenses, allowing downloading, using, modifying, and redistributing the model commercially with no royalty requirements or usage restrictions beyond the license notice. The weights are available on Hugging Face at microsoft/phi-4 in SafeTensors format for direct loading with the transformers library. GGUF-quantized versions are available from the bartowski repository on Hugging Face, supporting Ollama, LM Studio, and llama.cpp for local inference without any cloud API dependency. VRAM requirements for local hosting are approximately 8.30 GB for Q4_K_M quantization, 10 GB for Q5_K_M, 14+ GB for Q8_0 near-lossless format, and approximately 28 GB for FP16 full precision. A minimum of 12 GB VRAM is recommended to run the model at useful speed on consumer hardware. No restrictions apply to commercial or research use, and the MIT license permits creating and distributing fine-tuned derivatives.
What modalities does Phi-4 support?
Phi-4 (the 14B base instruct model) accepts text input only and generates text output only; it does not support images, audio, video, or PDF uploads in the base form. Function calling is not natively supported in base Phi-4; Microsoft added function calling support in Phi-4-mini-instruct and Phi-4-multimodal-instruct, not in the 14B base. There is no native JSON mode API parameter; structured output must be prompted manually using explicit format instructions. Phi-4-multimodal-instruct is a separate 5.6B model from March 2025 that processes text, images, and audio simultaneously using a Mixture-of-LoRAs technique, with function calling support, making it the family member for multimodal applications. Phi-4-reasoning, released April 2025, retains text-only input but adds extended chain-of-thought reasoning capability for math and science problems. For any multimodal application, use Phi-4-multimodal-instruct or a different model, not the base 14B Phi-4.
Does Phi-4 train on user data?
Phi-4 does not train on user API inputs by default; inputs sent to Azure AI Foundry are not used for model training according to Microsoft's standard API data terms. Azure AI Foundry retains inputs for up to 30 days for abuse monitoring under standard enterprise service terms, unless a zero-retention agreement is separately negotiated. For self-hosted deployments via Hugging Face or Ollama, no data ever leaves the user's own infrastructure, making self-hosting the strongest privacy option. Azure AI Foundry falls under Microsoft's enterprise compliance umbrella, which includes SOC 2 Type II certification and ISO 27001 certification. The platform is GDPR compliant and HIPAA-eligible for healthcare deployments when a Microsoft Business Associate Agreement (BAA) is in place. Microsoft's Responsible AI principles and Phi-4's model card document the data handling approach; the MIT license on the weights governs usage rights, while Azure service terms govern data handling on managed endpoints.
Who is Phi-4 best for and who should avoid it?
Phi-4 is best for teams that need high reasoning accuracy on STEM, math, and coding tasks at the lowest possible API cost per token, where the 16K context window is not a constraint. Researchers who want to fine-tune an MIT-licensed 14B model without license restrictions, or deploy locally on consumer hardware with 12 GB VRAM, will find Phi-4 practical and cost-free to host. Students and educators building STEM tutoring applications benefit from the 84.8% MMLU score at $0.065 per 1M input tokens, which is significantly cheaper than proprietary models of similar capability. Teams should avoid Phi-4 if their application requires multi-turn conversations with long histories, document analysis of full research papers, native function calling or structured output, or multimodal input handling, as the base model supports none of these. For long-context reasoning within the Phi family, Phi-4-mini-instruct (128K context) is a better starting point. For function calling and tool use, GPT-4o Mini or Phi-4-mini-instruct offer fully featured alternatives at comparable pricing.