Name: Phi-4: Microsoft's 14B Open-Source SLM, 84.8% MMLU (2025)
Brand: Microsoft
Price: 0.07 USD
Availability: InStock

Question 1

What is Phi-4 and who built it?

Accepted Answer

Phi-4 is a 14-billion parameter small language model developed by Microsoft Research, released December 12, 2024 via Azure AI Foundry and made publicly available on Hugging Face on January 9, 2025 under the MIT license. It uses a dense Transformer architecture trained on a curated mix of synthetic data, filtered public web text, and licensed academic sources, with no mixture-of-experts routing. The model targets graduate-level reasoning, STEM problem solving, and code generation, designed to compete with models three to five times its size. Phi-4 scores 84.8% on MMLU, 56.1% on GPQA Diamond, and 82.6% on HumanEval, outperforming GPT-4o on both GPQA and MATH despite having approximately 90 billion fewer parameters. It sits at the top of the original Phi-4 family, which also includes Phi-4-mini (3.8B), Phi-4-multimodal (5.6B), and Phi-4-reasoning (14B, April 2025). The MIT license makes Phi-4 one of the few high-performing 14B models that can be downloaded, fine-tuned, and deployed commercially at no cost.

Question 2

How much does Phi-4 cost per 1M tokens in 2025?

Accepted Answer

Via Azure AI Foundry, Phi-4 is priced at $0.065 per 1 million input tokens and $0.140 per 1 million output tokens, with no batch API discount published at time of writing. On OpenRouter and DeepInfra, pricing is comparable or slightly lower, with DeepInfra offering faster throughput at 54.8 tokens per second. For a workload generating 500,000 output tokens per day (roughly 700 typical assistant responses at 700 tokens each), daily cost via Azure is about $0.07. For a daily volume of 5 million output tokens, the daily cost is approximately $0.70, making it suitable for high-frequency low-latency use cases on a tight budget. Self-hosting via Ollama or vLLM on your own hardware brings API costs to zero, with only GPU electricity as ongoing cost since the MIT license permits commercial self-hosting. GitHub Models also provides free access to Phi-4 for individual development with rate limits.

Question 3

What is Phi-4's context window and max output?

Accepted Answer

Phi-4 has a context window of 16,384 tokens, which covers approximately 12,000 words or 25 to 30 pages of standard text. The maximum output length is also 16,384 tokens, meaning the combined input plus output cannot exceed 16,384 tokens in a single request. This is a significant limitation compared to GPT-4o Mini (128K context), Phi-4-mini-instruct (128K context), and Gemini 2.0 Flash (1 million tokens). For single-turn tasks like solving a math problem, summarizing a short article, or explaining a code snippet, the 16K window is adequate. For multi-turn conversations with long histories, RAG pipelines processing full research papers, or document classification at scale, the 16K limit will routinely cause truncation. There is no extended-context tier or sliding window option for the base Phi-4 model; teams needing more context should evaluate Phi-4-mini-instruct, which offers 128K tokens at 3.8B parameters.

Question 4

How does Phi-4 compare to GPT-4o on benchmarks?

Accepted Answer

Phi-4 outperforms GPT-4o on GPQA Diamond (56.1% vs 53.6%) and MATH (80.4% vs 76.6%) according to Microsoft's December 2024 technical report, a notable result for a 14B model against a leading frontier model at that time. On MMLU, Phi-4 scores 84.8% versus GPT-4o's 85.7%, a gap of under 1 percentage point. On HumanEval, Phi-4 scores 82.6% versus GPT-4o's 90.2%, a 7.6-point gap that matters for production coding tasks. However, GPT-4o has a 128K context window, native multimodal input (vision, audio, video), built-in function calling, and a much higher overall capability ceiling for complex agentic tasks. On Artificial Analysis's Intelligence Index, Phi-4 scores 10 versus much higher scores for current frontier models, reflecting that the benchmark wins are narrow and task-specific. For pure math and STEM reasoning at low cost, Phi-4 holds up; for production applications requiring long context, tool use, or multimodal input, GPT-4o or GPT-4o Mini offer a more complete feature set.

Question 5

Is Phi-4 open source?

Accepted Answer

Phi-4 is released under the MIT license, one of the most permissive open-source software licenses, allowing downloading, using, modifying, and redistributing the model commercially with no royalty requirements or usage restrictions beyond the license notice. The weights are available on Hugging Face at microsoft/phi-4 in SafeTensors format for direct loading with the transformers library. GGUF-quantized versions are available from the bartowski repository on Hugging Face, supporting Ollama, LM Studio, and llama.cpp for local inference without any cloud API dependency. VRAM requirements for local hosting are approximately 8.30 GB for Q4_K_M quantization, 10 GB for Q5_K_M, 14+ GB for Q8_0 near-lossless format, and approximately 28 GB for FP16 full precision. A minimum of 12 GB VRAM is recommended to run the model at useful speed on consumer hardware. No restrictions apply to commercial or research use, and the MIT license permits creating and distributing fine-tuned derivatives.

Question 6

What modalities does Phi-4 support?

Accepted Answer

Phi-4 (the 14B base instruct model) accepts text input only and generates text output only; it does not support images, audio, video, or PDF uploads in the base form. Function calling is not natively supported in base Phi-4; Microsoft added function calling support in Phi-4-mini-instruct and Phi-4-multimodal-instruct, not in the 14B base. There is no native JSON mode API parameter; structured output must be prompted manually using explicit format instructions. Phi-4-multimodal-instruct is a separate 5.6B model from March 2025 that processes text, images, and audio simultaneously using a Mixture-of-LoRAs technique, with function calling support, making it the family member for multimodal applications. Phi-4-reasoning, released April 2025, retains text-only input but adds extended chain-of-thought reasoning capability for math and science problems. For any multimodal application, use Phi-4-multimodal-instruct or a different model, not the base 14B Phi-4.

Question 7

Does Phi-4 train on user data?

Accepted Answer

Phi-4 does not train on user API inputs by default; inputs sent to Azure AI Foundry are not used for model training according to Microsoft's standard API data terms. Azure AI Foundry retains inputs for up to 30 days for abuse monitoring under standard enterprise service terms, unless a zero-retention agreement is separately negotiated. For self-hosted deployments via Hugging Face or Ollama, no data ever leaves the user's own infrastructure, making self-hosting the strongest privacy option. Azure AI Foundry falls under Microsoft's enterprise compliance umbrella, which includes SOC 2 Type II certification and ISO 27001 certification. The platform is GDPR compliant and HIPAA-eligible for healthcare deployments when a Microsoft Business Associate Agreement (BAA) is in place. Microsoft's Responsible AI principles and Phi-4's model card document the data handling approach; the MIT license on the weights governs usage rights, while Azure service terms govern data handling on managed endpoints.

Question 8

Who is Phi-4 best for and who should avoid it?

Accepted Answer

Phi-4 is best for teams that need high reasoning accuracy on STEM, math, and coding tasks at the lowest possible API cost per token, where the 16K context window is not a constraint. Researchers who want to fine-tune an MIT-licensed 14B model without license restrictions, or deploy locally on consumer hardware with 12 GB VRAM, will find Phi-4 practical and cost-free to host. Students and educators building STEM tutoring applications benefit from the 84.8% MMLU score at $0.065 per 1M input tokens, which is significantly cheaper than proprietary models of similar capability. Teams should avoid Phi-4 if their application requires multi-turn conversations with long histories, document analysis of full research papers, native function calling or structured output, or multimodal input handling, as the base model supports none of these. For long-context reasoning within the Phi family, Phi-4-mini-instruct (128K context) is a better starting point. For function calling and tool use, GPT-4o Mini or Phi-4-mini-instruct offer fully featured alternatives at comparable pricing.

Phi-4: Microsoft's 14B Open-Source SLM, 84.8% MMLU (2025)

About Phi-4

Pricing

Key Features

Pros

Cons

Benchmarks

Frequently Asked Questions