131K
0.222
0.665
Chat
Active

Nemotron Nano 12B V2 VL

Optimized for low-latency deployment, it excels in optical character recognition (OCR), chart reasoning, document comprehension, and long-form video analysis.
Try it now

AI Playground

Test all API models in the sandbox environment before you integrate. We provide more than 200 models to integrate into your app.
AI Playground image
Ai models list in playground
Testimonials

Our Clients' Voices

Nemotron Nano 12B V2 VLTechflow Logo - Techflow X Webflow Template

Nemotron Nano 12B V2 VL

Nano 12B V2 VL is a 12-billion-parameter open multimodal reasoning model engineered for vision-language inference, processing text and multi-image inputs to generate coherent natural-language responses.

Nemotron Nano 12B V2 VL API Overview

Nemotron Nano 12B V2 VL is an advanced 12-billion-parameter open multimodal vision-language model developed by NVIDIA. It excels in video understanding, multi-image document reasoning, and natural language output generation. Harnessing a novel hybrid Transformer-Mamba architecture, it balances transformer-level accuracy with Mamba’s memory-efficient sequence modeling. This enables fast throughput and low-latency inference, optimized for complex tasks involving text and images, especially long documents and videos.

Technical Specifications

  • Model Size: 12.6 billion parameters
  • Architecture: Hybrid Transformer-Mamba sequence model
  • Context Window: Ultra-long with up to 128,000 tokens
  • Input Modalities: Text, multi-image documents, video frames

Performance Benchmarks

  • OCRBench v2: Leading accuracy in optical character recognition for document understanding
  • Multimodal Reasoning: Average score ≈74 across MMMU, MathVista, AI2D, ChartQA, DocVQA, Video-MME
  • Video Comprehension: Efficient Video Sampling (EVS) enables long-form video processing with reduced inference cost
  • Multilingual Accuracy: Strong results across multiple languages with robust visual question answering and document parsing

Key Features

  • Low Latency VL Inference: Optimized for fast, high-throughput reasoning on text + images
  • Efficient Long-Context Processing: Handles lengthy videos and documents up to 128K tokens using innovative token reduction techniques
  • Multi-Image & Video Understanding: Simultaneous analysis of multiple images and video frames for comprehensive scene interpretation and summarization
  • High-Resolution & Wide Layout Support: Processes tiled images and panoramic inputs for charts, forms, and complex visual documents
  • Multimodal Querying: Supports visual question answering, document data extraction, multi-step reasoning, and dense captioning in multiple languages
  • Hybrid Transformer-Mamba Architecture: Balances accuracy of transformers with memory efficiency of Mamba for inferencing scalability

Nemotron Nano 12B V2 VL API Pricing

  • Input: $0.22155 / 1M tokens
  • Output: $0.66465 / 1M tokens

Code Sample

Comparison with Other Models

vs Qwen3 32B VL: Nemotron excels in OCR and video benchmarks, whereas Qwen3 prioritizes versatility across tasks. Both deliver strong performance, but Nemotron is optimized for real-time applications.

vs LLaVA-1.5: While LLaVA-1.5 is a competitive research model known for innovative multimodal instruction tuning, Nemotron Nano 12B V2 VL outperforms it in document intelligence, OCR, and extended video reasoning by incorporating dedicated vision encoders and efficient video sampling techniques.

vs Eagle 2.5: Eagle 2.5 is strong in general visual question answering, but Nemotron offers more specialized capabilities in chart reasoning, document understanding, and video comprehension.

vs InternVL 14B V2: Nemotron’s hybrid Mamba-Transformer backbone achieves greater throughput on long-context tasks, making it more suitable for real-time AI agents processing dense visual and text data.

Try it now

400+ AI Models

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Best Growth Choice
for Enterprise

Get API Key