What are the technical specifications of Nemotron Nano 12B V2 VL?

Model Size: 12.6 billion parameters. Architecture: Hybrid Transformer-Mamba sequence model. Context Window: Ultra-long with up to 128,000 tokens. Input Modalities: Text, multi-image documents, video frames.

What are the performance benchmarks for Nemotron Nano 12B V2 VL?

OCRBench v2: Leading accuracy in optical character recognition. Multimodal Reasoning: Average score ≈74 across MMMU, MathVista, AI2D, ChartQA, DocVQA, Video-MME. Video Comprehension: Efficient Video Sampling enables long-form video processing with reduced inference cost. Multilingual Accuracy: Strong results across multiple languages.

What are the key features of Nemotron Nano 12B V2 VL?

Low Latency VL Inference: Optimized for fast reasoning on text + images. Efficient Long-Context Processing: Handles lengthy videos and documents up to 128K tokens. Multi-Image & Video Understanding: Simultaneous analysis of multiple images and video frames. High-Resolution & Wide Layout Support: Processes tiled images and panoramic inputs. Multimodal Querying: Supports visual question answering and document data extraction. Hybrid Transformer-Mamba Architecture: Balances accuracy with memory efficiency.

What is the pricing for Nemotron Nano 12B V2 VL API?

Input: $0.22155 / 1M tokens. Output: $0.66465 / 1M tokens.

What are the main use cases for Nemotron Nano 12B V2 VL?

Document Intelligence: Automated extraction and analysis of invoices, contracts, and manuals. Visual Question Answering: Query complex images, charts, or video scenes. Video Analytics: Summarization and scene understanding for long-form videos. Data Analysis & Reporting: Generate structured reports from multimodal data. Media Asset Management: Dense captioning and indexing for video content. Cross-Lingual Multimodal Tasks: Handling diverse language inputs combined with images.

What are the technical specifications of Nemotron Nano 12B V2 VL?

Model Size: 12.6 billion parameters. Architecture: Hybrid Transformer-Mamba sequence model. Context Window: Ultra-long with up to 128,000 tokens. Input Modalities: Text, multi-image documents, video frames.

What are the performance benchmarks for Nemotron Nano 12B V2 VL?

OCRBench v2: Leading accuracy in optical character recognition. Multimodal Reasoning: Average score ≈74 across MMMU, MathVista, AI2D, ChartQA, DocVQA, Video-MME. Video Comprehension: Efficient Video Sampling enables long-form video processing with reduced inference cost. Multilingual Accuracy: Strong results across multiple languages.

What are the key features of Nemotron Nano 12B V2 VL?

Low Latency VL Inference: Optimized for fast reasoning on text + images. Efficient Long-Context Processing: Handles lengthy videos and documents up to 128K tokens. Multi-Image & Video Understanding: Simultaneous analysis of multiple images and video frames. High-Resolution & Wide Layout Support: Processes tiled images and panoramic inputs. Multimodal Querying: Supports visual question answering and document data extraction. Hybrid Transformer-Mamba Architecture: Balances accuracy with memory efficiency.

What is the pricing for Nemotron Nano 12B V2 VL API?

Input: $0.22155 / 1M tokens. Output: $0.66465 / 1M tokens.

What are the main use cases for Nemotron Nano 12B V2 VL?

Document Intelligence: Automated extraction and analysis of invoices, contracts, and manuals. Visual Question Answering: Query complex images, charts, or video scenes. Video Analytics: Summarization and scene understanding for long-form videos. Data Analysis & Reporting: Generate structured reports from multimodal data. Media Asset Management: Dense captioning and indexing for video content. Cross-Lingual Multimodal Tasks: Handling diverse language inputs combined with images.

Nemotron Nano 12B V2 VL API

Name: Nemotron Nano 12B V2 VL API
Brand: NVIDIA

Nemotron Nano 12B V2 VL

Nano 12B V2 VL is a 12-billion-parameter open multimodal reasoning model engineered for vision-language inference, processing text and multi-image inputs to generate coherent natural-language responses.

Nemotron Nano 12B V2 VL API Overview

Nemotron Nano 12B V2 VL is an advanced 12-billion-parameter open multimodal vision-language model developed by NVIDIA. It excels in video understanding, multi-image document reasoning, and natural language output generation. Harnessing a novel hybrid Transformer-Mamba architecture, it balances transformer-level accuracy with Mamba’s memory-efficient sequence modeling. This enables fast throughput and low-latency inference, optimized for complex tasks involving text and images, especially long documents and videos.

‍

Technical Specifications

Model Size: 12.6 billion parameters
Architecture: Hybrid Transformer-Mamba sequence model
Context Window: Ultra-long with up to 128,000 tokens
Input Modalities: Text, multi-image documents, video frames

‍

Performance Benchmarks

OCRBench v2: Leading accuracy in optical character recognition for document understanding
Multimodal Reasoning: Average score ≈74 across MMMU, MathVista, AI2D, ChartQA, DocVQA, Video-MME
Video Comprehension: Efficient Video Sampling (EVS) enables long-form video processing with reduced inference cost
Multilingual Accuracy: Strong results across multiple languages with robust visual question answering and document parsing

‍

Key Features

Low Latency VL Inference: Optimized for fast, high-throughput reasoning on text + images
Efficient Long-Context Processing: Handles lengthy videos and documents up to 128K tokens using innovative token reduction techniques
Multi-Image & Video Understanding: Simultaneous analysis of multiple images and video frames for comprehensive scene interpretation and summarization
High-Resolution & Wide Layout Support: Processes tiled images and panoramic inputs for charts, forms, and complex visual documents
Multimodal Querying: Supports visual question answering, document data extraction, multi-step reasoning, and dense captioning in multiple languages
Hybrid Transformer-Mamba Architecture: Balances accuracy of transformers with memory efficiency of Mamba for inferencing scalability

‍

Nemotron Nano 12B V2 VL API Pricing

Input: $0.2743 / 1M tokens
Output: $0.66465 / 1M tokens

Code Sample

Comparison with Other Models

vs Qwen3 32B VL: Nemotron excels in OCR and video benchmarks, whereas Qwen3 prioritizes versatility across tasks. Both deliver strong performance, but Nemotron is optimized for real-time applications.

vs LLaVA-1.5: While LLaVA-1.5 is a competitive research model known for innovative multimodal instruction tuning, Nemotron Nano 12B V2 VL outperforms it in document intelligence, OCR, and extended video reasoning by incorporating dedicated vision encoders and efficient video sampling techniques.

vs Eagle 2.5: Eagle 2.5 is strong in general visual question answering, but Nemotron offers more specialized capabilities in chart reasoning, document understanding, and video comprehension.

vs InternVL 14B V2: Nemotron’s hybrid Mamba-Transformer backbone achieves greater throughput on long-context tasks, making it more suitable for real-time AI agents processing dense visual and text data.

Example H2

Try it now