DeepSeek V4 Flash: 284B MoE with 1M Context & 79% SWE-bench (May 2026)
DeepSeek V4 Flash: MIT-licensed open-source MoE with 284B/13B params, 1M context, 79% SWE-bench. $0.14/M input tokens. 10-50x cheaper than Claude. Benchmark scores, pricing, self-hosting guide, and API endpoints.
DeepSeek-V4 Flash: 284B-param open-source MoE released April 2026. Scores 79% SWE-bench, 88.1% GPQA. Costs $0.14/M input (10-50x cheaper than Claude/GPT). 1M context, MIT license, self-hostable on H100 or quantized to RTX 4090.
DeepSeek-V4 Flash is a 284B-parameter Mixture-of-Experts model released April 24, 2026, with 13B activated params. It scores 79% on SWE-bench Verified and 88.1% on GPQA Diamond. Pricing: $0.14 per 1M input tokens, $0.28 per 1M output tokens with 98% prompt caching discount. Native 1M context window. MIT licensed, open-source on Hugging Face.
Provider: DeepSeek · Family: DeepSeek-V4
Context window: 1,048,576 tokens · Max output: 384,000
Input modalities: text, image, tool-calls · Output: text, tool-calls
About DeepSeek-V4 Flash
DeepSeek-V4 Flash is an open-source Mixture-of-Experts language model released by DeepSeek on April 24, 2026, as part of the DeepSeek-V4 preview. The model features 284 billion total parameters with only 13 billion activated per token, making it dramatically more efficient than dense equivalents while maintaining frontier-class reasoning and coding capabilities. It is positioned as the faster, cost-optimized sibling to DeepSeek-V4 Pro (1.6T/49B), both licensed under MIT for commercial deployment and self-hosting without vendor permission. The V4 series represents a generational leap in DeepSeek's architecture, introducing Hybrid Attention combining Compressed Sparse Attention and Heavily Compressed Attention, reducing per-token FLOPs to roughly 10% of V3.2 at 1M-token context lengths. V4 Flash achieves strong benchmark performance across the frontier evaluation suite. On SWE-bench Verified, it scores 79%, placing it in the top tier for agentic coding alongside Claude Opus 4.7 (93.9%) and GPT-5.5 (94.6%)—a narrow 14-point gap from the best-in-class. On GPQA Diamond, a rigorous test of graduate-level reasoning requiring multi-step logic and domain knowledge, V4 Flash scores 88.1%, again competitive with closed-frontier models. HumanEval coding benchmark puts V4 Flash near 96%, matching or exceeding most prior-generation proprietary models. The Artificial Analysis Intelligence Index composite score is 47, placing V4 Flash above DeepSeek-V3.2 and comparable to Claude Sonnet 4.6 in extended-reasoning mode, though behind V4 Pro's score of 52 and the absolute frontier (GPT-5.5, Gemini 3.1 Pro). On Chatbot Arena (LM Arena), V4-Pro ranks #20 globally and #2 among open models on the text leaderboard; V4-Flash's preliminary ranking sits around #47-#55, reflecting its slightly lower capability ceiling on general-purpose tasks versus the Pro variant. The 1M-token context window is production-ready at both Flash and Pro tiers, with internal needle-in-haystack testing demonstrating 97-99% retrieval accuracy for facts buried at varying depths. Unlike many competitors, there is no separate paid tier or gating for the full 1M-token capacity—both are included in the base API pricing. The model supports a 384K-token maximum output, enabling long-form document generation, comprehensive code refactoring across multiple files, and extended chain-of-thought reasoning. Both thinking (reasoning Max, High, or standard Non-thinking) and non-thinking modes are exposed via the same deepseek-v4-flash model ID, toggled at inference time with a single parameter. Pricing is the killer feature: V4 Flash costs $0.14 per million input tokens and $0.28 per million output tokens on the native DeepSeek API, with prompt caching reducing cache-hit input to $0.0028 per million (a 98% discount). This is roughly 10-20 times cheaper than Claude Opus 4.6 ($15 input, $25 output) and 50-100 times cheaper than GPT-5.5 ($3 input, $30 output) on equivalent inputs. For a typical 10K-context request with 2K output, a single inference costs $1.42 on Claude Opus, $300+ on GPT-5.5, but only $0.20 on V4 Flash—a difference that compounds at scale. The broader DeepSeek-V4 pricing strategy has triggered what industry observers describe as a cost-performance reset: proprietary models can no longer compete on price, only on marginal capability gains in narrow benchmarks. The open-weights release on Hugging Face (huggingface.co/deepseek-ai/DeepSeek-V4-Flash) enables self-hosting, fine-tuning, and research access without vendor lock-in. Official weights ship in FP4+FP8 mixed precision, keeping the total download size to roughly 160GB for full precision and allowing aggressive quantization down to Q4_K_M GGUF (roughly 60GB) with minimal quality loss. A single H100 80GB GPU at FP8, or two H100s in distributed inference, can run V4 Flash for real-time deployments; community quantizations (Unsloth, others) have pushed it onto single RTX 4090s at Q4 quantization. For enterprises, this means forking from proprietary APIs is operationally viable: train a LoRA adapter on your data, deploy on your infrastructure, and never expose prompts or outputs to a third party. Archiecture wise, the Hybrid Attention mechanism is the standout innovation. Compressed Sparse Attention applies a 4x compression ratio to the KV cache along the sequence dimension, then uses sparse attention (top-1024 relevant tokens per query) plus a 128-token local sliding window to balance retrieval quality with compute. Heavily Compressed Attention applies 128x compression for a second attention head, enabling very long-range context synthesis. At 1M tokens, V4 Flash requires only 7% of the KV cache of V3.2 and 10% of the per-token FLOPs—a remarkable feat that makes million-token contexts economically viable for real workloads. Manifold-Constrained Hyper-Connections strengthen residual connections via Sinkhorn-Knopp orthogonalization, improving gradient flow and training stability. The MoE router uses DeepSeekMoE with Sqrt(Softplus) affinity scoring and hash-routed experts for the first few layers, striking a balance between capacity and routing efficiency. Multimodal support is not at launch: V4 Flash is text-only in April 2026. However, DeepSeek V4-Vision (a separate model family) offers native image understanding at vastly lower cost than GPT-4o or Claude 3.5 Sonnet, with roughly 90 image tokens per image versus 870 for competitors. Audio and video are announced for future releases. Tool use and function calling are fully supported, with both OpenAI-compatible and Anthropic-compatible API formats available, enabling agentic workflows with reliable structured output. Safety and alignment details are not extensively published in early reviews, but DeepSeek's approach combines SFT (supervised fine-tuning) with GRPO (group relative policy optimization) in a two-stage pipeline: domain experts develop domain-specific alignments independently, then models are unified via on-policy distillation. The post-training regime is less stringent than Claude's Constitutional AI or GPT's RLHF-heavy approach, resulting in a model that is more permissive on edge cases and slightly more prone to less-than-ideal refusals. Red-teaming and jailbreak resistance details are not yet public; the model is in preview status, and safety posture may evolve. Deprecation timeline: DeepSeek's legacy API model names deepseek-chat and deepseek-reasoner are routing to V4-Flash non-thinking and thinking modes during a transition period, but both are fully deprecated and inaccessible after July 24, 2026, 15:59 UTC. Any production workload must migrate to deepseek-v4-flash or deepseek-v4-pro before that date. The company is also aggressively pushing the V4 line across third-party providers: Azure AI Foundry, Google Vertex AI, AWS Bedrock, Fireworks, Together AI, and others all offer V4-Flash within weeks of the public launch, reducing single-vendor risk for enterprises. Who should use V4 Flash: startups and teams with cost-sensitive long-context use cases (research analysis, document processing, contract review), open-source-first AI builders needing reproducible deployments, and enterprises requiring no data residency in the US (can self-host or use local cloud regions). Who should avoid it: teams requiring state-of-the-art agentic coding (Claude Opus 4.7 leads on SWE-bench by 14 points), applications needing visual input at launch (use DeepSeek V4-Vision or GPT-4o), voice-first products (no audio I/O), and organizations with strict safety/alignment standards (V4's post-training is lighter than frontier models). For math and multilingual reasoning, V4 Flash is competitive; for pure coding autonomy, V4-Pro or Claude Opus 4.7 remain superior but cost 10-50x more per token.
Pricing
Native DeepSeek API pricing as of April 24, 2026. Prompt caching reduces cached input to $0.0028/1M (98% discount). Batch API available for 20-40% additional discount pending confirmation. Third-party providers (OpenRouter, Azure, Vertex, Bedrock, Fireworks) may have slightly different rates.
Key Features
- 1M Token Context Window: Handles million-token inputs with 99% retrieval accuracy verified on internal needle-in-haystack. Hybrid Attention (CSA + HCA) reduces FLOPs to 10% of V3.2.
- Reasoning Modes (Max/High/Standard): Toggle reasoning effort per request. Max mode adds visible chain-of-thought for STEM tasks; standard mode is fast and cost-efficient.
- Mixture of Experts Efficiency: 284B total / 13B active parameters. Inference cost scales with active params, not total params. Comparable to 70B dense model but with 4x the knowledge capacity.
- Prompt Caching: Cached inputs cost $0.0028/1M (98% discount) with 24h TTL. Critical for batch analysis, document retrieval loops, and repeated system prompts.
- Tool Use & Function Calling: Native support for OpenAI-compatible and Anthropic-compatible tool schemas. Reliable structured output via function calling.
Pros
- Pricing is transformative: $0.14/M input is 10-50x cheaper than Claude/GPT, enabling cost-sensitive production at scale.
- 1M context with proven long-range retrieval and efficient KV cache (10% of V3.2); economical for long-document tasks.
- Open-source MIT license with weights on Hugging Face; self-host without permission, fine-tune, or deploy air-gapped.
Cons
- Text-only at launch; no native image, audio, or video input (use separate DeepSeek V4-Vision or competitors).
- 79% SWE-bench vs Claude 93.9%; not the best for multi-file agentic coding; 14-point gap on complex tasks.
- Lighter safety alignment than Claude/GPT; more permissive on edge cases and potentially unsafe requests.
Benchmarks
- humaneval: 96.4
- live bench: 91.6
- lmarena elo: 1300
- gpqa diamond: 88.1
- lmarena rank: 50
- swe bench verified: 79
- artificial analysis intelligence index: 47
- artificial analysis price blended per m: 0.21
- artificial analysis speed tokens per sec: 93.9
Frequently Asked Questions
What is DeepSeek-V4 Flash and who built it?
DeepSeek-V4 Flash is an open-source Mixture-of-Experts language model released by DeepSeek on April 24, 2026, as part of the V4 preview series. It features 284 billion total parameters with only 13 billion activated per token, making it a cost-efficient alternative to dense models. The architecture uses Hybrid Attention combining Compressed Sparse Attention and Heavily Compressed Attention, reducing inference costs to 10% of DeepSeek-V3.2 at 1M-token context. Licensed under MIT, it is the first tier of DeepSeek's two-variant lineup alongside V4-Pro (1.6T/49B). V4-Flash scores 79% on SWE-bench Verified (agentic coding), 88.1% on GPQA Diamond (reasoning), and 96.4% on HumanEval (code generation) - placing it in the top tier for open-source models while remaining in preview status.
How much does DeepSeek-V4 Flash cost per 1M tokens?
DeepSeek-V4 Flash costs $0.14 per million input tokens and $0.28 per million output tokens on the native DeepSeek API as of April 24, 2026. This is the base pricing. Prompt caching reduces cached input tokens to $0.0028 per million (98% discount) with a 24-hour TTL. For comparison: Claude Opus 4.6 costs $15 input and $25 output per 1M tokens; GPT-5.5 costs $3-5 input and $30 output. A typical 10K-context request with 2K output costs $1.42 on Claude but only $0.20 on V4-Flash - roughly 7x cheaper. A free tier provides 5M tokens per signup (valid 30 days) without requiring a credit card. Third-party providers like OpenRouter, Azure, Vertex, and Bedrock may have slightly different rates but are all in the $0.10-0.16/M range.
What is DeepSeek-V4 Flash's context window and maximum output?
DeepSeek-V4 Flash supports a native 1,048,576-token (1M token) context window with a 384,000-token maximum output limit. Internal needle-in-haystack testing demonstrates 97-99% retrieval accuracy for facts buried at varying depths within the 1M context. Unlike some competitors, there is no separate paid tier or gating - the full 1M context is included in the base API pricing. The efficient Hybrid Attention architecture (Compressed Sparse Attention + Heavily Compressed Attention) achieves this capability while reducing KV cache and per-token FLOPs to only 10% of DeepSeek-V3.2, making million-token contexts economically viable for real workloads. Both Flash and Pro variants ship with the same 1M context window, confirming DeepSeek did not degrade long-context to achieve cost efficiency.
How does DeepSeek-V4 Flash compare on benchmarks vs Claude Opus and GPT-5.5?
On SWE-bench Verified (agentic coding), DeepSeek V4-Flash scores 79%, Claude Opus 4.7 scores 93.9%, and GPT-5.5 scores 94.6% - a 14-point gap between Flash and Claude. On GPQA Diamond (graduate-level reasoning), V4-Flash scores 88.1%, ahead of most prior open-weight models and competitive with closed frontier models. HumanEval (code generation) puts V4-Flash at approximately 96.4%, matching or exceeding prior-generation models like Claude Sonnet. The Artificial Analysis Intelligence Index composite score is 47 for V4-Flash, placing it above V3.2 (42) and comparable to Claude Sonnet 4.6 extended-reasoning but behind V4-Pro (52) and absolute frontier models (GPT-5.5, Gemini 3.1 Pro at 60+). On Chatbot Arena (LM Arena), V4-Pro ranks 20 globally and 2 among open models; V4-Flash's preliminary ranking is #47-55. Bottom line: V4-Flash is not the best for complex multi-file coding (14 points behind Claude), but competitive for reasoning and cost-prohibitive for budget teams on any other model.
Is DeepSeek-V4 Flash open source or proprietary?
DeepSeek-V4 Flash is fully open-source and open-weights under the MIT License, one of the most permissive licenses available. Model weights are publicly available on Hugging Face at huggingface.co/deepseek-ai/DeepSeek-V4-Flash (base and instruct variants) as of April 24, 2026. You can download the weights, self-host on your infrastructure (H100 or quantized to RTX 4090), fine-tune for your domain, and deploy commercially without contacting DeepSeek. Official weights ship in FP4+FP8 mixed precision (~160GB download); community quantizations via Unsloth and others have created GGUF versions (Q4_K_M ~60GB) that fit on single consumer GPUs. This stands in contrast to proprietary models like GPT-5.5 and Claude Opus, which are API-only and do not allow self-hosting or modification. Enterprise benefit: forking from proprietary APIs is operationally viable with V4-Flash - train a LoRA, deploy internally, never expose prompts to a third party.
What modalities and capabilities does DeepSeek-V4 Flash support?
DeepSeek-V4 Flash at launch is text-input/text-output only, with native tool use and function calling. Confirmed capabilities: (1) Text input and output; (2) Tool use via both OpenAI-compatible and Anthropic-compatible function calling schemas; (3) Reliable structured output when prompted with JSON schemas. Not at launch: (1) Native image understanding - a separate DeepSeek V4-Vision model handles images at vastly lower cost than GPT-4o or Claude; (2) Audio input or output; (3) Video understanding. Reasoning modes are toggled per-request: standard (fast), high (more reasoning), or max (deepest reasoning). Both thinking and non-thinking modes are exposed via the same model ID, eliminating the need for separate routing logic as with legacy deepseek-chat vs deepseek-reasoner. If you need vision, use the DeepSeek V4-Vision family or GPT-4o. If you need audio, use a separate ASR/TTS service or wait for future multimodal releases.
Does DeepSeek-V4 Flash train on user data?
No, DeepSeek does not train on API inputs by default. API inputs and outputs are retained for 30 days for abuse monitoring and optional content moderation, then deleted unless flagged. There is no ability to opt out of the 30-day retention on the standard tier. Enterprise customers with strict data governance requirements can negotiate zero-retention policies, though explicit HIPAA, SOC2, and GDPR compliance documentation is not published as of the April 2026 preview. Data residency options exist: default is China; via AWS Bedrock you can run V4-Flash in US regions; via Azure and Google Cloud you can run in EU regions. For maximum privacy and compliance, self-hosting the open-source model on your own infrastructure is the most secure option - no data leaves your network, and there are no third-party audit requirements to satisfy.
Who is DeepSeek-V4 Flash best for and who should avoid it?
Best for: (1) Cost-sensitive startups and teams building agentic systems or autonomous research platforms; (2) Open-source-first developers and researchers needing reproducible, self-hosted models; (3) Enterprises with long-context use cases (contract review, document analysis, research synthesis) and tight budgets; (4) Teams avoiding US-based cloud providers or requiring air-gapped deployment. Avoid for: (1) Mission-critical multi-file code refactoring (14-point SWE-bench gap vs Claude Opus 4.7); GPT-5.5 or Claude Opus are better bets for production software engineering. (2) Visual understanding at launch - use DeepSeek V4-Vision or GPT-4o for images. (3) Audio-first or voice products - V4-Flash has no audio I/O. (4) Strict safety and alignment requirements - V4's post-training is lighter than Claude or GPT; it is more permissive on edge cases and potentially allows requests competitors would refuse. For pure cost optimization on long-context tasks, V4-Flash has no peer.