

Nemotron 3 Ultra — reasoning and orchestration model from NVIDIA built on a hybrid Transformer-Mamba Mixture-of-Experts architecture with 550B parameters and 1M token context.
NVIDIA Nemotron 3 Ultra is a next-generation reasoning and orchestration model based on a hybrid Transformer-Mamba Mixture-of-Experts architecture with 550 billion total parameters and 55 billion active parameters per forward pass. Designed for enterprise-grade agentic workflows, it delivers strong performance on complex reasoning tasks, multi-step analysis, and long-document understanding — with support for contexts up to 1 million tokens.
Technical Specifications
Performance Benchmarks
- 550B total parameters, 55B active per forward pass (MoE architecture).
- Context window: up to 1,000,000 tokens.
- Strong performance on reasoning, coding, and instruction-following benchmarks.
- Optimized for multi-step and chain-of-thought reasoning scenarios.
- Suitable for both interactive chat and batch inference workloads.
Architecture Breakdown
Nemotron 3 Ultra uses a hybrid Transformer-Mamba architecture combined with Mixture-of-Experts (MoE) routing. The MoE design activates only 55B of 550B parameters per token, enabling efficient inference at scale. The Mamba layers provide linear-complexity sequence modeling for long-context tasks, while Transformer attention layers handle high-precision reasoning over complex inputs.
API Pricing
- Input: $0.65 / 1M tokens
- Output: $3.25 / 1M tokens
- Cache Read: $0.195 / 1M tokens
Core Features & Capabilities
- Extended Context: Supports up to 1M token context for long-document analysis and retrieval.
- Complex Reasoning: Optimized for chain-of-thought, multi-step problem solving, and logical inference.
- Tool Use: Supports function calling for agentic and orchestration workflows.
- Agent Orchestration: Designed as both an orchestrator and sub-agent in multi-agent pipelines.
- Instruction Following: Strong performance on precise instruction adherence across diverse tasks.
- Code Generation: Capable of generating, reviewing, and debugging complex code across languages.
- Long-Context Summarization: Processes and summarizes large documents, codebases, and transcripts.
Comparison with Other Models
VS DeepSeek R1: Nemotron 3 Ultra offers a 1M token context window vs. DeepSeek R1's shorter context; both target advanced reasoning but Nemotron uses a hybrid MoE architecture optimized for enterprise deployment.
VS Claude Sonnet 4: Claude Sonnet 4 focuses on balanced performance and speed; Nemotron 3 Ultra prioritizes extended context and MoE-driven reasoning efficiency at scale.
VS GPT-4o: GPT-4o delivers multimodal capabilities; Nemotron 3 Ultra specializes in long-context text reasoning and agentic orchestration with a larger parameter footprint.
NVIDIA Nemotron 3 Ultra is a next-generation reasoning and orchestration model based on a hybrid Transformer-Mamba Mixture-of-Experts architecture with 550 billion total parameters and 55 billion active parameters per forward pass. Designed for enterprise-grade agentic workflows, it delivers strong performance on complex reasoning tasks, multi-step analysis, and long-document understanding — with support for contexts up to 1 million tokens.
Technical Specifications
Performance Benchmarks
- 550B total parameters, 55B active per forward pass (MoE architecture).
- Context window: up to 1,000,000 tokens.
- Strong performance on reasoning, coding, and instruction-following benchmarks.
- Optimized for multi-step and chain-of-thought reasoning scenarios.
- Suitable for both interactive chat and batch inference workloads.
Architecture Breakdown
Nemotron 3 Ultra uses a hybrid Transformer-Mamba architecture combined with Mixture-of-Experts (MoE) routing. The MoE design activates only 55B of 550B parameters per token, enabling efficient inference at scale. The Mamba layers provide linear-complexity sequence modeling for long-context tasks, while Transformer attention layers handle high-precision reasoning over complex inputs.
API Pricing
- Input: $0.65 / 1M tokens
- Output: $3.25 / 1M tokens
- Cache Read: $0.195 / 1M tokens
Core Features & Capabilities
- Extended Context: Supports up to 1M token context for long-document analysis and retrieval.
- Complex Reasoning: Optimized for chain-of-thought, multi-step problem solving, and logical inference.
- Tool Use: Supports function calling for agentic and orchestration workflows.
- Agent Orchestration: Designed as both an orchestrator and sub-agent in multi-agent pipelines.
- Instruction Following: Strong performance on precise instruction adherence across diverse tasks.
- Code Generation: Capable of generating, reviewing, and debugging complex code across languages.
- Long-Context Summarization: Processes and summarizes large documents, codebases, and transcripts.
Comparison with Other Models
VS DeepSeek R1: Nemotron 3 Ultra offers a 1M token context window vs. DeepSeek R1's shorter context; both target advanced reasoning but Nemotron uses a hybrid MoE architecture optimized for enterprise deployment.
VS Claude Sonnet 4: Claude Sonnet 4 focuses on balanced performance and speed; Nemotron 3 Ultra prioritizes extended context and MoE-driven reasoning efficiency at scale.
VS GPT-4o: GPT-4o delivers multimodal capabilities; Nemotron 3 Ultra specializes in long-context text reasoning and agentic orchestration with a larger parameter footprint.