Name: Nemotron 3 Nano 30B A3B API
Brand: NVIDIA

Nemotron 3 Nano 30B A3B

A sparse mixture-of-experts language model that activates just 3B parameters per token while drawing on 30B of learned knowledge, built from the ground up for agentic AI systems, production RAG pipelines, and long-context reasoning at scale.

What Is Nemotron 3 Nano 30B A3B?

30B total parameters, A3B meaning roughly 3 billion are activated per inference pass. NVIDIA built this model from scratch — not fine-tuned on top of someone else's base — training on 25 trillion tokens of text covering code, math, science, general knowledge, and multilingual data.

What makes it genuinely different from most open-weight models in this size class is the architecture. Rather than stacking standard transformer attention layers throughout, NVIDIA combined three distinct layer types — Mamba-2, Mixture-of-Experts, and grouped-query attention — into a hybrid stack that runs significantly faster in practice while handling very long contexts without the usual memory blowup.

Architecture: Hybrid Mamba-Transformer MoE

The architectural backbone of Nemotron 3 Nano is what separates it from straightforward scaled-down transformers. NVIDIA calls it a Hybrid Mamba-Transformer MoE, and understanding each component helps clarify exactly why the model performs the way it does.

Parameter	Value
Total Layers	52
Mamba-2 Layers	23
MoE Layers	23
GQA Attention Layers	6 (2 groups)
Experts per MoE Layer	128 routed + 1 shared
Activated Experts per Token	6
Hidden Dimension	2,688
Activation Function	Squared ReLU (ReLU²)
Normalization	RMSNorm
Context Window	Up to 262,144 tokens (1M supported)

Benchmark Performance

Numbers matter more than marketing, so here's where Nemotron 3 Nano 30B A3B actually lands against its direct open-weight competitors — Qwen3-30B-A3B and GPT-OSS-20B.

Metric	Value
AIME 2025 — Math Reasoning (no tools)	89.1%
Qwen3-30B-A3B (baseline)	85.0%
GPT-OSS-20B	91.7%
AIME 2025 — Math Reasoning (with Python tools)	99.2%
Performance Insight	Near theoretical ceiling with tool integration
Inference Throughput vs Qwen3-30B-A3B (H200, 8K→16K)	3.3× faster
Throughput vs GPT-OSS-20B	2.2× faster (same hardware)

On long-context evaluations (RULER benchmark), Nemotron 3 Nano outperforms both Qwen3-30B-A3B-Instruct-2507 and GPT-OSS-20B across varying context lengths — a direct payoff from the Mamba-2 linear-time design. FP8 quantization retains approximately 99% of BF16 accuracy, meaning the model runs efficiently on 24GB VRAM cards like the RTX 4090 without a significant quality hit.

Capabilities and Use Cases

Nemotron 3 Nano was fine-tuned and RL-trained with production use cases in mind, not benchmark chasing. It handles both reasoning and non-reasoning modes via a flag in the chat template — useful for cutting latency on simpler tasks while retaining full chain-of-thought quality where it matters.

Agentic Systems

Explicitly designed for multi-step agent loops: tool calling, planning, structured output generation, and code execution. The AIME-with-tools score of 99.2% illustrates how well it integrates with external functions.

RAG & Long-Context Processing

A 262K-token context window means large codebases, legal documents, research papers, or extended session histories fit in a single call. Long-range coherence is maintained without the memory overhead of pure-attention models.

Code Generation & Review

Fine-tuned on high-quality code data spanning multiple programming languages. Strong performance on HumanEval and MBPP benchmarks, with a particular edge in software engineering tasks involving agentic tool use.

Mathematical Reasoning

Trained on math-specific synthetic data and RL-reinforced on reasoning tasks. Competitive with much larger models on AIME 2025, MATH-500, and similar benchmarks, especially when Python execution is available.

Custom Fine-Tuning

Open weights, training recipes, SFT datasets, and RL datasets are all released. Deploy on your own infrastructure using vLLM, SGLang, or TensorRT-LLM, with full control over privacy and customization.

API Pricing

$0.065 input
$0.26 output

How It Compares: Nemotron 3 Nano vs the Competition

The MoE-at-30B-with-3B-active category has a few prominent models right now. Here's a grounded comparison:

Model	Active Params	Context	Architecture	Open Weights
Nemotron 3 Nano 30B A3B	~3B	262K	Hybrid Mamba-MoE	✓ Full
Qwen3-30B-A3B	~3B	128K	Standard Transformer MoE	✓ Full
GPT-OSS-20B	~20B	128K	Dense Transformer	✓ Full

The key differentiator for Nemotron 3 Nano is the Mamba-2 integration. While Qwen3-30B-A3B matches it on active parameters and comes close on many benchmarks, the hybrid architecture gives Nemotron a decisive throughput edge on long sequences. For workloads where you're regularly processing 50K+ token inputs — full codebases, lengthy document sets, extended agent histories — the 3.3× throughput advantage is a real operational consideration, not a footnote.

The trade-off is that GPT-OSS-20B edges Nemotron 3 Nano on general knowledge (MMLU) benchmarks. For broad conversational QA, the dense-parameter model has a slight edge. For reasoning with tools, long-context tasks, and agentic workflows, Nemotron 3 Nano's efficiency-per-token math is hard to argue with at this price point.

‍

Example H2

Try it now