Name: Nemotron 3 Super 120B-A12B API
Brand: NVIDIA

Nemotron 3 Super 120B-A12B

A hybrid Mixture-of-Experts reasoning model that punches far above its active parameter count — running on just 12 billion active weights while drawing on the depth of 120 billion total parameters. Built for the realities of production-grade agentic systems.

What Is Nemotron 3 Super?

NVIDIA's Nemotron 3 Super 120B-A12B is part of the third generation of the Nemotron open model family — a series engineered specifically for building specialized, reliable AI agents rather than serving as a general-purpose chatbot. The "Super" designation marks a meaningful architectural step up from the lighter Nano variant, introducing several capabilities that simply weren't present before.

NVIDIA achieved this through their LatentMoE approach, where expert routing happens in a compressed latent space rather than full model dimensionality, meaning the system is smarter about which experts to engage, not just how many.

Architecture

LatentMoE

Tokens are first projected into a compressed latent space for expert routing and computation. This reduces the byte cost per unit of model intelligence — you call 4 experts but pay roughly the compute of 1. It's what makes the 12B active parameter count feel much larger than it is.

Mamba-2 + Attention Hybrid

Rather than relying purely on attention, the model interleaves Mamba-2 state-space layers with selective attention blocks. Mamba-2 handles long-range context efficiently; attention handles local precision. The combination is faster than a dense transformer at long contexts.

Multi-Token Prediction (MTP)

Unlike models that predict one token at a time, Nemotron 3 Super uses MTP layers as a native speculative decoding mechanism. This is what drives the 167+ tokens/sec output speed — it's not just hardware, it's baked into the model weights.

NVFP4 Pretraining

The Super model is the first in its family pretrained at NVFP4 precision rather than using it only for post-training quantization. This allowed training on the full 25T+ token corpus more efficiently without sacrificing the quality typically expected of BF16-trained models.

1 Million Token Context

Not just a theoretical maximum — the model outperforms both GPT-OSS-120B and Qwen3.5-122B on the RULER benchmark at the full 1M token setting. This matters for agent workflows where conversation and tool-use history must stay in context across hundreds of steps.

Configurable Reasoning Mode

Reasoning behavior is toggled via a flag in the chat template. When enabled, the model generates an internal reasoning trace before its final response — useful for complex multi-step tasks. When disabled, it responds directly, reducing latency for simpler queries.

Nemotron 3 Super API Pricing

$0.117 input
$0.585 output

Benchmark Results Worth Knowing

Benchmark scores are only meaningful when you know what they're measuring. Below are the evaluations most relevant to real-world developer use cases — not cherry-picked academic tasks, but coding agents, long-context retrieval, math reasoning, and instruction-following under load.

Benchmark	What It Tests	Score	Notes
AIME 2025	Competitive math reasoning	Leading open score	Multi-environment RL training cited as key driver
SWE-Bench Verified	Real GitHub issue resolution	Top open model tier	Tested as a full coding agent, not just code completion
TerminalBench	CLI agent task completion	Leading open model	Requires multi-step tool use in terminal environments
PinchBench	Coding agent accuracy	85.6%	Best open model result at time of release
RULER @ 1M ctx	Long-range retrieval accuracy	Outperforms GPT-OSS-120B	Also beats Qwen3.5-122B at full 1M token length
AA Intelligence Index v4	Composite reasoning score	36 / avg 15	Well above average among comparable open-weight models
Throughput vs. GPT-OSS-120B	Tokens per second at scale	2.2× faster	8k input / 64k output setting
Throughput vs. Qwen3.5-122B	Tokens per second at scale	7.5× faster	Same test setting as above

Where Nemotron 3 Super Fits Best

This isn't a model you deploy for simple question-and-answer flows. It's built for sustained, multi-step, tool-using workflows where context accumulates and reasoning chains span dozens of decisions. Here's where it genuinely earns its keep.

Multi-Agent Orchestration

Designed from the ground up for collaborative agent pipelines. The million-token context lets it track complete state across planner, researcher, and executor sub-agents without truncation.

Autonomous Coding Agents

SWE-Bench Verified and PinchBench scores reflect performance on actual repositories — filing fixes, navigating codebases, and executing across a terminal. Not just code generation.

Enterprise Chatbots & RAG

High-volume workloads like IT ticket triage are a primary design target. The MoE architecture keeps per-call compute costs low even when request volume is sustained.

Long-Document Analysis

Cross-document aggregation and multi-document reasoning were explicitly part of the fine-tuning data. Useful for legal review, technical due diligence, and research summarization.

Structured Output Generation

Fine-tuned on structured output tasks explicitly. JSON schema adherence, tool-call formatting, and instruction-following with complex constraints are reliable in production settings.

Math & Scientific Reasoning

Pre-training included substantial synthetic math and science data. Reinforcement learning across 10+ environments — including formal reasoning — drove the AIME 2025 benchmark results.

‍

Example H2

Try it now