Name: Nemotron 3 Nano 30B A3B: 1M Context & 3.3x Speed (2026)
Brand: NVIDIA
Price: 0.05 USD
Availability: InStock

Question 1

What is Nemotron 3 Nano 30B A3B and who built it?

Accepted Answer

Nemotron 3 Nano 30B-A3B is an open-weight language model built by NVIDIA and released on December 14, 2025 as the entry tier of the Nemotron 3 family, which later added Super (120B, March 2026) and Ultra (550B, June 2026) models. It uses a hybrid architecture that interleaves Mamba-2 state-space layers, grouped-query-attention Transformer layers, and Mixture-of-Experts routing across 128 experts (6 active per token, plus shared experts), across 52 total layers. The model has 31.6 billion total parameters but only around 3.6 billion active per token, roughly 10% of its weights. On NVIDIA's own technical report it scores 78.3% on MMLU-Pro, 89.06% on AIME 2025 without tools (99.17% with tools enabled), and 73.04% on GPQA. It was designed specifically to make high-throughput agentic AI, tool-calling loops with long context, cheap to run at scale rather than to maximize raw parameter count. It competes directly against Qwen3-30B-A3B and GPT-OSS-20B in the open-weight 30B-class tier, and is priced at $0.05 input / $0.20 output per 1M tokens on OpenRouter and DeepInfra.

Question 2

How much does Nemotron 3 Nano 30B A3B cost per 1M tokens?

Accepted Answer

Nemotron 3 Nano is an open-weight model, so NVIDIA does not set an official per-token price; pricing depends on which provider hosts it. Both DeepInfra and OpenRouter list it at $0.05 per 1M input tokens and $0.20 per 1M output tokens, among the cheapest rates for a model of this benchmark tier. A 100K-token document summary costs roughly $0.007 at this rate, and a 1M-in/200K-out coding agent session runs about $0.09. There is no published cached-input discount or batch API tier from these providers as of mid-2026. Self-hosting on a single NVIDIA H100 or H200 GPU eliminates per-token fees entirely and is the lowest-cost option at high sustained volume, since only 3.6B of the model's 31.6B parameters activate per token. Compared to Qwen3-30B-A3B, which is priced similarly on most open-model hosts, Nemotron's 3.3x throughput advantage on identical hardware translates into a lower effective cost per request at scale even at the same nominal per-token rate. There is no free managed tier from NVIDIA directly, though OpenRouter offers rate-limited free access.

Question 3

What is Nemotron 3 Nano 30B A3B's context window and max output?

Accepted Answer

Nemotron 3 Nano supports a native 1 million token context window, achieved through continued pretraining at a 512K sequence length after initial training. NVIDIA has not published a separate maximum output token limit distinct from the overall context budget. Long-context recall is verified on the RULER benchmark, where the model retains 68.2% accuracy at full context length, ahead of Qwen3-30B-A3B's retention at comparable depth. Many managed API providers, including OpenRouter and DeepInfra, default to exposing only a 262K-token context window rather than the full 1M, so developers who need the extended window should check provider configuration settings or self-host with an inference engine explicitly configured for 1M tokens. There is no separate long-context pricing tier since the model is open-weight; self-hosted deployments bear the memory cost of the larger KV cache directly. Compared to competitors, its 1M-token ceiling matches or exceeds most open 30B-class models, and its RULER retention score is a specific advantage over Qwen3-30B at similar depth.

Question 4

How does Nemotron 3 Nano 30B A3B compare on benchmarks vs Qwen3-30B-A3B?

Accepted Answer

Nemotron 3 Nano trails Qwen3-30B-A3B slightly on general knowledge, scoring 78.3% on MMLU-Pro versus Qwen3's 80.9%, a roughly 2.6-point gap. However, Nemotron claims a 3.3x throughput advantage over Qwen3-30B-A3B when both run on identical single-H200 GPU hardware at an 8K input / 16K output workload, which NVIDIA attributes to its hybrid Mamba-Transformer architecture activating fewer parameters more efficiently per token. On long-context retention, Nemotron holds 68.2% accuracy at full context on the RULER benchmark, ahead of Qwen3-30B's retention at similar depth. NVIDIA's own benchmarks are self-reported in its technical report (arXiv 2512.20848) rather than independently verified by a third party, so the MMLU-Pro and throughput comparisons should be read as vendor claims pending independent confirmation. In practice, a 2.6-point MMLU-Pro gap translates to marginally more errors on general knowledge and reasoning tasks, while the throughput gap translates directly into lower GPU-hours and dollar cost for high-volume agentic workloads. Neither model publishes a SWE-bench Verified score directly comparable to frontier proprietary models, so head-to-head coding comparisons rely on LiveCodeBench and SWE-Bench (OpenHands) numbers instead.

Question 5

Is Nemotron 3 Nano 30B A3B open source or proprietary?

Accepted Answer

Nemotron 3 Nano is open-weight, released under the NVIDIA Open Model License, a custom commercially-permissive license rather than a standard permissive license like Apache 2.0 or MIT, so it is classified as open-weights rather than fully open-source. The license grants a perpetual, worldwide, royalty-free right to use, modify, and redistribute the model and its derivatives for commercial purposes, provided redistributors carry forward the license text and an attribution notice, and comply with NVIDIA's Trustworthy AI usage terms. Rights terminate automatically if a user disables or circumvents the model's built-in safety guardrails. Weights are available on Hugging Face in BF16 (full precision, requiring roughly 63GB of VRAM on a single H200), FP8 (roughly 32GB), and 4-bit GGUF quantized variants for lower-memory self-hosting. There are no separate proprietary-only variants of this specific Nano model; the larger Super and Ultra tiers in the Nemotron 3 family follow the same open licensing approach. Commercial use, including building and selling products on top of the model, is explicitly permitted under the license terms.

Question 6

What modalities does Nemotron 3 Nano 30B A3B support?

Accepted Answer

Nemotron 3 Nano 30B-A3B is a text-in, text-out model: it accepts text prompts and produces text output, including structured tool-call and function-calling output, but it does not accept image, audio, or video input directly. Its native tool-calling support is a first-class capability, boosting AIME 2025 math accuracy from 89.06% without tools to 99.17% with tools enabled, and it supports structured JSON-style output for agentic function calling. For multimodal use cases, NVIDIA released a separate model, Nemotron 3 Nano Omni, specifically built to handle document, audio, and video input in a single model; that variant should be used instead if visual or audio understanding is required. The model also ships with a configurable reasoning mode, toggling extended chain-of-thought token generation on or off with an adjustable thinking-token budget, which is a capability distinct from modality support but often discussed alongside it. There is no computer-use or screen-reading capability documented for this model.

Question 7

Does Nemotron 3 Nano 30B A3B train on user data?

Accepted Answer

Because Nemotron 3 Nano is an open-weight model rather than a hosted proprietary service, there is no single default data-retention policy that applies universally; behavior depends entirely on which provider serves the model. Self-hosted deployments give the operator full control over prompts and outputs with no data leaving their own infrastructure. For third-party managed APIs like OpenRouter and DeepInfra, retention and training-on-inputs policies are governed by each provider's own terms rather than NVIDIA's, and were not independently confirmed for this profile. NVIDIA's own build.nvidia.com hosted endpoint did not have a clearly published retention policy in the documentation reviewed. NVIDIA does not disclose SOC 2, ISO 27001, or GDPR compliance specific to Nemotron model inference (its broader corporate compliance certifications, covering ISO 27001, SOC 2, and others, apply to its enterprise cloud and hardware businesses generally, not specifically to this open model's inference path). Teams with strict data-residency or compliance requirements around this model should self-host it within their own certified infrastructure rather than relying on a third-party API's default policy.

Question 8

Who is Nemotron 3 Nano 30B A3B best for and who should avoid it?

Accepted Answer

Nemotron 3 Nano is best suited for teams building high-volume agentic pipelines or multi-agent tool-calling systems where inference cost and throughput matter as much as raw accuracy, since its hybrid MoE architecture delivers 3.3x the throughput of Qwen3-30B-A3B on the same hardware. It also suits teams processing very long documents or extended conversation histories on a budget, given its 1M-token context window and strong RULER long-context retention, and teams that specifically need to self-host an open model with commercial redistribution rights under a permissive custom license. Teams needing image, audio, or video understanding should avoid this specific model and use Nemotron 3 Nano Omni instead, since this variant is strictly text-in/text-out. Teams chasing the single highest MMLU-Pro score in the 30B-class open-weight tier should consider Qwen3-30B-A3B instead, which scores roughly 2.6 points higher. Teams building frontier-level autonomous coding agents may find its 38.76% SWE-Bench (OpenHands) score too modest compared to dedicated coding-focused proprietary models like Claude or GPT-5-class systems, and should evaluate those instead for that specific workload.

Nemotron 3 Nano 30B A3B: 1M Context & 3.3x Speed (2026)

About Nemotron 3 Nano 30B A3B

Pricing

Key Features

Pros

Cons

Benchmarks

Frequently Asked Questions