Nemotron 3 Nano 30B A3B: 1M Context & 3.3x Speed (2026)

NVIDIA's open Nemotron 3 Nano 30B-A3B (Dec 2025) hits 78.3% MMLU-Pro with 1M context, 3.3x Qwen3-30B's throughput, and $0.05/$0.20 per 1M token API pricing.

Nemotron 3 Nano 30B-A3B is NVIDIA's open hybrid Mamba-Transformer MoE model (released December 14, 2025) with a 1M-token context window, 78.3% MMLU-Pro, and 89.06% AIME 2025 accuracy. Priced at $0.05/$0.20 per 1M tokens on OpenRouter and DeepInfra, it delivers 3.3x the throughput of Qwen3-30B-A3B on the same hardware.

Nemotron 3 Nano 30B-A3B, released December 14, 2025 by NVIDIA, is a hybrid Mamba-Transformer MoE model with a 1M-token context and a 78.3% MMLU-Pro score. It costs $0.05 input / $0.20 output per 1M tokens on OpenRouter and DeepInfra, and delivers 3.3x the throughput of Qwen3-30B-A3B on identical hardware.

Provider: NVIDIA · Family: Nemotron 3

Context window: 1,000,000 tokens

Input modalities: text, tool-calls · Output: text, tool-calls

About Nemotron 3 Nano 30B A3B

Nemotron 3 Nano 30B-A3B is an open-weight language model built by NVIDIA and released on December 14-15, 2025 as the first entry in the Nemotron 3 family, followed by the larger Super (120B, March 2026) and Ultra (550B, June 2026) tiers. It uses a hybrid architecture that interleaves Mamba-2 state-space layers with grouped-query-attention Transformer layers inside a Mixture-of-Experts framework: 52 total layers, 23 Mamba-2 and 23 MoE layers plus 6 GQA attention layers, with each MoE layer routing to 6 of 128 experts per token through a learned MLP router, alongside shared experts active on every token. The model was designed to make agentic AI, tool-calling loops with long conversation histories, affordable at production scale rather than to chase the largest possible parameter count. On benchmarks from NVIDIA's own technical report (arXiv 2512.20848), Nemotron 3 Nano scores 78.30% on MMLU-Pro, 89.06% on AIME 2025 without tool use (rising to 99.17% with tools enabled), 73.04% on GPQA without tools (75.00% with tools), 68.25% on LiveCodeBench v6, 38.76% on SWE-Bench via the OpenHands harness, and 71.51% on IFBench prompt-following. Against its closest open-weight rival, Alibaba's Qwen3-30B-A3B, Nemotron trails slightly on MMLU-Pro (78.3% vs 80.9%) but claims 3.3x higher throughput on identical single-H200-GPU hardware at 8K input/16K output, and 2.2x higher throughput than GPT-OSS-20B. On the RULER long-context benchmark, Nemotron 3 Nano holds 68.2% accuracy at full context length, ahead of Qwen3-30B's retention at comparable depth. The model has a native 1M-token context window, extended via continued pretraining at a 512K sequence length; the default deployment configuration many providers expose is 262K tokens, with the full 1M window available by adjusting inference server settings. NVIDIA has not published an explicit max output token limit separate from the context budget. The architecture activates only about 3.2B parameters per token (3.6B including embeddings) out of 31.6B total, roughly 10% of total weights, which is the core lever behind its throughput advantage. Nemotron 3 Nano is text-in, text-out with native tool-calling and structured function output support; it does not accept image, audio, or video input (a separate Omni variant handles multimodal input). It ships with a Reasoning ON/OFF toggle and a configurable thinking-token budget, letting developers trade accuracy for lower inference cost per request. It supports 15+ languages in its post-training data mix. As an open-weight model, pricing depends entirely on deployment path. Third-party managed APIs, DeepInfra and OpenRouter both list $0.05 per 1M input tokens and $0.20 per 1M output tokens, among the cheapest rates in its capability class. Self-hosting on an H100 or H200 GPU brings the lowest per-token cost at real scale but requires upfront GPU capacity; a single H200 handles the full BF16 checkpoint. Example costs: a 100K-token document summary runs about $0.007 on the DeepInfra/OpenRouter rate; a 1M-token-in/200K-token-out coding agent session costs roughly $0.09. Deployment options include NVIDIA's own build.nvidia.com hosted API, Hugging Face weight downloads (BF16, FP8, and 4-bit GGUF variants), OpenRouter, DeepInfra, and self-hosting via vLLM, SGLang, TensorRT-LLM, Llama.cpp, LM Studio, or Unsloth. The model is released under the NVIDIA Open Model License, a commercially permissive custom license (not Apache or MIT) that grants a perpetual, worldwide, royalty-free right to use, modify, and redistribute the model and its derivatives, conditioned on carrying forward the license and NVIDIA's Trustworthy AI usage terms; rights terminate automatically if a user disables built-in safety guardrails. NVIDIA released roughly 11,000 labeled agent-safety traces from tool-using workflows alongside the model to help developers evaluate and red-team agentic behavior, its main disclosed safety artifact for this release; NVIDIA does not publish a formal system card or named third-party red-team partner list in the style of Anthropic or OpenAI. Training used a 25-trillion-token pretraining corpus (including 2.5T newly-added Common Crawl tokens) plus 13 million post-training samples, with a Warmup-Stable-Decay learning-rate schedule; an exact training data cutoff date was not disclosed in NVIDIA's public materials. Nemotron 3 Nano is best suited for teams building high-volume agentic pipelines, multi-agent orchestration, or tool-calling assistants where per-request cost and throughput matter more than topping every reasoning benchmark, and for teams wanting to self-host an open model with a 1M-token context budget. It is a weaker choice for teams needing image or audio understanding (use Nemotron 3 Nano Omni or a multimodal frontier model instead), or for teams that specifically need the highest raw MMLU-Pro score in the 30B-class open-weight tier, where Qwen3-30B-A3B currently edges it out. NVIDIA has signaled a coalition-built Nemotron 4 family as the next major generation, with Nemotron 3 Super and Ultra already extending the same architecture to 120B and 550B total parameters through mid-2026.

Pricing

$0.05 per 1M input tokens and $0.20 per 1M output tokens on DeepInfra and OpenRouter. No official NVIDIA-set price since this is an open-weight model; self-hosting on owned or rented H100/H200 GPUs is the lowest per-token cost at scale.

Key Features

Pros

Cons

Benchmarks

Frequently Asked Questions

What is Nemotron 3 Nano 30B A3B and who built it?

Nemotron 3 Nano 30B-A3B is an open-weight language model built by NVIDIA and released on December 14, 2025 as the entry tier of the Nemotron 3 family, which later added Super (120B, March 2026) and Ultra (550B, June 2026) models. It uses a hybrid architecture that interleaves Mamba-2 state-space layers, grouped-query-attention Transformer layers, and Mixture-of-Experts routing across 128 experts (6 active per token, plus shared experts), across 52 total layers. The model has 31.6 billion total parameters but only around 3.6 billion active per token, roughly 10% of its weights. On NVIDIA's own technical report it scores 78.3% on MMLU-Pro, 89.06% on AIME 2025 without tools (99.17% with tools enabled), and 73.04% on GPQA. It was designed specifically to make high-throughput agentic AI, tool-calling loops with long context, cheap to run at scale rather than to maximize raw parameter count. It competes directly against Qwen3-30B-A3B and GPT-OSS-20B in the open-weight 30B-class tier, and is priced at $0.05 input / $0.20 output per 1M tokens on OpenRouter and DeepInfra.

How much does Nemotron 3 Nano 30B A3B cost per 1M tokens?

Nemotron 3 Nano is an open-weight model, so NVIDIA does not set an official per-token price; pricing depends on which provider hosts it. Both DeepInfra and OpenRouter list it at $0.05 per 1M input tokens and $0.20 per 1M output tokens, among the cheapest rates for a model of this benchmark tier. A 100K-token document summary costs roughly $0.007 at this rate, and a 1M-in/200K-out coding agent session runs about $0.09. There is no published cached-input discount or batch API tier from these providers as of mid-2026. Self-hosting on a single NVIDIA H100 or H200 GPU eliminates per-token fees entirely and is the lowest-cost option at high sustained volume, since only 3.6B of the model's 31.6B parameters activate per token. Compared to Qwen3-30B-A3B, which is priced similarly on most open-model hosts, Nemotron's 3.3x throughput advantage on identical hardware translates into a lower effective cost per request at scale even at the same nominal per-token rate. There is no free managed tier from NVIDIA directly, though OpenRouter offers rate-limited free access.

What is Nemotron 3 Nano 30B A3B's context window and max output?

Nemotron 3 Nano supports a native 1 million token context window, achieved through continued pretraining at a 512K sequence length after initial training. NVIDIA has not published a separate maximum output token limit distinct from the overall context budget. Long-context recall is verified on the RULER benchmark, where the model retains 68.2% accuracy at full context length, ahead of Qwen3-30B-A3B's retention at comparable depth. Many managed API providers, including OpenRouter and DeepInfra, default to exposing only a 262K-token context window rather than the full 1M, so developers who need the extended window should check provider configuration settings or self-host with an inference engine explicitly configured for 1M tokens. There is no separate long-context pricing tier since the model is open-weight; self-hosted deployments bear the memory cost of the larger KV cache directly. Compared to competitors, its 1M-token ceiling matches or exceeds most open 30B-class models, and its RULER retention score is a specific advantage over Qwen3-30B at similar depth.

How does Nemotron 3 Nano 30B A3B compare on benchmarks vs Qwen3-30B-A3B?

Nemotron 3 Nano trails Qwen3-30B-A3B slightly on general knowledge, scoring 78.3% on MMLU-Pro versus Qwen3's 80.9%, a roughly 2.6-point gap. However, Nemotron claims a 3.3x throughput advantage over Qwen3-30B-A3B when both run on identical single-H200 GPU hardware at an 8K input / 16K output workload, which NVIDIA attributes to its hybrid Mamba-Transformer architecture activating fewer parameters more efficiently per token. On long-context retention, Nemotron holds 68.2% accuracy at full context on the RULER benchmark, ahead of Qwen3-30B's retention at similar depth. NVIDIA's own benchmarks are self-reported in its technical report (arXiv 2512.20848) rather than independently verified by a third party, so the MMLU-Pro and throughput comparisons should be read as vendor claims pending independent confirmation. In practice, a 2.6-point MMLU-Pro gap translates to marginally more errors on general knowledge and reasoning tasks, while the throughput gap translates directly into lower GPU-hours and dollar cost for high-volume agentic workloads. Neither model publishes a SWE-bench Verified score directly comparable to frontier proprietary models, so head-to-head coding comparisons rely on LiveCodeBench and SWE-Bench (OpenHands) numbers instead.

Is Nemotron 3 Nano 30B A3B open source or proprietary?

Nemotron 3 Nano is open-weight, released under the NVIDIA Open Model License, a custom commercially-permissive license rather than a standard permissive license like Apache 2.0 or MIT, so it is classified as open-weights rather than fully open-source. The license grants a perpetual, worldwide, royalty-free right to use, modify, and redistribute the model and its derivatives for commercial purposes, provided redistributors carry forward the license text and an attribution notice, and comply with NVIDIA's Trustworthy AI usage terms. Rights terminate automatically if a user disables or circumvents the model's built-in safety guardrails. Weights are available on Hugging Face in BF16 (full precision, requiring roughly 63GB of VRAM on a single H200), FP8 (roughly 32GB), and 4-bit GGUF quantized variants for lower-memory self-hosting. There are no separate proprietary-only variants of this specific Nano model; the larger Super and Ultra tiers in the Nemotron 3 family follow the same open licensing approach. Commercial use, including building and selling products on top of the model, is explicitly permitted under the license terms.

What modalities does Nemotron 3 Nano 30B A3B support?

Nemotron 3 Nano 30B-A3B is a text-in, text-out model: it accepts text prompts and produces text output, including structured tool-call and function-calling output, but it does not accept image, audio, or video input directly. Its native tool-calling support is a first-class capability, boosting AIME 2025 math accuracy from 89.06% without tools to 99.17% with tools enabled, and it supports structured JSON-style output for agentic function calling. For multimodal use cases, NVIDIA released a separate model, Nemotron 3 Nano Omni, specifically built to handle document, audio, and video input in a single model; that variant should be used instead if visual or audio understanding is required. The model also ships with a configurable reasoning mode, toggling extended chain-of-thought token generation on or off with an adjustable thinking-token budget, which is a capability distinct from modality support but often discussed alongside it. There is no computer-use or screen-reading capability documented for this model.

Does Nemotron 3 Nano 30B A3B train on user data?

Because Nemotron 3 Nano is an open-weight model rather than a hosted proprietary service, there is no single default data-retention policy that applies universally; behavior depends entirely on which provider serves the model. Self-hosted deployments give the operator full control over prompts and outputs with no data leaving their own infrastructure. For third-party managed APIs like OpenRouter and DeepInfra, retention and training-on-inputs policies are governed by each provider's own terms rather than NVIDIA's, and were not independently confirmed for this profile. NVIDIA's own build.nvidia.com hosted endpoint did not have a clearly published retention policy in the documentation reviewed. NVIDIA does not disclose SOC 2, ISO 27001, or GDPR compliance specific to Nemotron model inference (its broader corporate compliance certifications, covering ISO 27001, SOC 2, and others, apply to its enterprise cloud and hardware businesses generally, not specifically to this open model's inference path). Teams with strict data-residency or compliance requirements around this model should self-host it within their own certified infrastructure rather than relying on a third-party API's default policy.

Who is Nemotron 3 Nano 30B A3B best for and who should avoid it?

Nemotron 3 Nano is best suited for teams building high-volume agentic pipelines or multi-agent tool-calling systems where inference cost and throughput matter as much as raw accuracy, since its hybrid MoE architecture delivers 3.3x the throughput of Qwen3-30B-A3B on the same hardware. It also suits teams processing very long documents or extended conversation histories on a budget, given its 1M-token context window and strong RULER long-context retention, and teams that specifically need to self-host an open model with commercial redistribution rights under a permissive custom license. Teams needing image, audio, or video understanding should avoid this specific model and use Nemotron 3 Nano Omni instead, since this variant is strictly text-in/text-out. Teams chasing the single highest MMLU-Pro score in the 30B-class open-weight tier should consider Qwen3-30B-A3B instead, which scores roughly 2.6 points higher. Teams building frontier-level autonomous coding agents may find its 38.76% SWE-Bench (OpenHands) score too modest compared to dedicated coding-focused proprietary models like Claude or GPT-5-class systems, and should evaluate those instead for that specific workload.

Visit Nemotron 3 Nano 30B A3B Official Page