

A hybrid Mixture-of-Experts reasoning model that punches far above its active parameter count — running on just 12 billion active weights while drawing on the depth of 120 billion total parameters. Built for the realities of production-grade agentic systems.
NVIDIA's Nemotron 3 Super 120B-A12B is part of the third generation of the Nemotron open model family — a series engineered specifically for building specialized, reliable AI agents rather than serving as a general-purpose chatbot. The "Super" designation marks a meaningful architectural step up from the lighter Nano variant, introducing several capabilities that simply weren't present before.
NVIDIA achieved this through their LatentMoE approach, where expert routing happens in a compressed latent space rather than full model dimensionality, meaning the system is smarter about which experts to engage, not just how many.
Tokens are first projected into a compressed latent space for expert routing and computation. This reduces the byte cost per unit of model intelligence — you call 4 experts but pay roughly the compute of 1. It's what makes the 12B active parameter count feel much larger than it is.
Rather than relying purely on attention, the model interleaves Mamba-2 state-space layers with selective attention blocks. Mamba-2 handles long-range context efficiently; attention handles local precision. The combination is faster than a dense transformer at long contexts.
Unlike models that predict one token at a time, Nemotron 3 Super uses MTP layers as a native speculative decoding mechanism. This is what drives the 167+ tokens/sec output speed — it's not just hardware, it's baked into the model weights.
The Super model is the first in its family pretrained at NVFP4 precision rather than using it only for post-training quantization. This allowed training on the full 25T+ token corpus more efficiently without sacrificing the quality typically expected of BF16-trained models.
Not just a theoretical maximum — the model outperforms both GPT-OSS-120B and Qwen3.5-122B on the RULER benchmark at the full 1M token setting. This matters for agent workflows where conversation and tool-use history must stay in context across hundreds of steps.
Reasoning behavior is toggled via a flag in the chat template. When enabled, the model generates an internal reasoning trace before its final response — useful for complex multi-step tasks. When disabled, it responds directly, reducing latency for simpler queries.
Benchmark scores are only meaningful when you know what they're measuring. Below are the evaluations most relevant to real-world developer use cases — not cherry-picked academic tasks, but coding agents, long-context retrieval, math reasoning, and instruction-following under load.
This isn't a model you deploy for simple question-and-answer flows. It's built for sustained, multi-step, tool-using workflows where context accumulates and reasoning chains span dozens of decisions. Here's where it genuinely earns its keep.
Designed from the ground up for collaborative agent pipelines. The million-token context lets it track complete state across planner, researcher, and executor sub-agents without truncation.
SWE-Bench Verified and PinchBench scores reflect performance on actual repositories — filing fixes, navigating codebases, and executing across a terminal. Not just code generation.
High-volume workloads like IT ticket triage are a primary design target. The MoE architecture keeps per-call compute costs low even when request volume is sustained.
Cross-document aggregation and multi-document reasoning were explicitly part of the fine-tuning data. Useful for legal review, technical due diligence, and research summarization.
Fine-tuned on structured output tasks explicitly. JSON schema adherence, tool-call formatting, and instruction-following with complex constraints are reliable in production settings.
Pre-training included substantial synthetic math and science data. Reinforcement learning across 10+ environments — including formal reasoning — drove the AIME 2025 benchmark results.
NVIDIA's Nemotron 3 Super 120B-A12B is part of the third generation of the Nemotron open model family — a series engineered specifically for building specialized, reliable AI agents rather than serving as a general-purpose chatbot. The "Super" designation marks a meaningful architectural step up from the lighter Nano variant, introducing several capabilities that simply weren't present before.
NVIDIA achieved this through their LatentMoE approach, where expert routing happens in a compressed latent space rather than full model dimensionality, meaning the system is smarter about which experts to engage, not just how many.
Tokens are first projected into a compressed latent space for expert routing and computation. This reduces the byte cost per unit of model intelligence — you call 4 experts but pay roughly the compute of 1. It's what makes the 12B active parameter count feel much larger than it is.
Rather than relying purely on attention, the model interleaves Mamba-2 state-space layers with selective attention blocks. Mamba-2 handles long-range context efficiently; attention handles local precision. The combination is faster than a dense transformer at long contexts.
Unlike models that predict one token at a time, Nemotron 3 Super uses MTP layers as a native speculative decoding mechanism. This is what drives the 167+ tokens/sec output speed — it's not just hardware, it's baked into the model weights.
The Super model is the first in its family pretrained at NVFP4 precision rather than using it only for post-training quantization. This allowed training on the full 25T+ token corpus more efficiently without sacrificing the quality typically expected of BF16-trained models.
Not just a theoretical maximum — the model outperforms both GPT-OSS-120B and Qwen3.5-122B on the RULER benchmark at the full 1M token setting. This matters for agent workflows where conversation and tool-use history must stay in context across hundreds of steps.
Reasoning behavior is toggled via a flag in the chat template. When enabled, the model generates an internal reasoning trace before its final response — useful for complex multi-step tasks. When disabled, it responds directly, reducing latency for simpler queries.
Benchmark scores are only meaningful when you know what they're measuring. Below are the evaluations most relevant to real-world developer use cases — not cherry-picked academic tasks, but coding agents, long-context retrieval, math reasoning, and instruction-following under load.
This isn't a model you deploy for simple question-and-answer flows. It's built for sustained, multi-step, tool-using workflows where context accumulates and reasoning chains span dozens of decisions. Here's where it genuinely earns its keep.
Designed from the ground up for collaborative agent pipelines. The million-token context lets it track complete state across planner, researcher, and executor sub-agents without truncation.
SWE-Bench Verified and PinchBench scores reflect performance on actual repositories — filing fixes, navigating codebases, and executing across a terminal. Not just code generation.
High-volume workloads like IT ticket triage are a primary design target. The MoE architecture keeps per-call compute costs low even when request volume is sustained.
Cross-document aggregation and multi-document reasoning were explicitly part of the fine-tuning data. Useful for legal review, technical due diligence, and research summarization.
Fine-tuned on structured output tasks explicitly. JSON schema adherence, tool-call formatting, and instruction-following with complex constraints are reliable in production settings.
Pre-training included substantial synthetic math and science data. Reinforcement learning across 10+ environments — including formal reasoning — drove the AIME 2025 benchmark results.