
MiMo-V2.5 is Xiaomi's native omnimodal AI, trained from scratch to understand text, images, audio, and video together.
Released on April 22, 2026 by Xiaomi's AI team, MiMo-V2.5 picks up where MiMo-V2-Omni left off, with substantially better agentic performance, sharper visual reasoning, and native support for a 1 million token context window. It's designed for production — not benchmarks alone.
The model sits at the base of the V2.5 family. For teams that need the absolute ceiling on agentic tasks and software engineering, there's MiMo-V2.5-Pro. For general-purpose multimodal work at an honest price, MiMo-V2.5 is the practical starting point.
MiMo-V2.5 API Pricing:
Not a text model with adapters — all four modalities trained jointly from day one, so cross-modal reasoning is genuinely coherent.
The newer multimodal model beats the previous text-only flagship on general agentic benchmarks — at half the token cost.
No more chunking. Process entire research papers, long video transcripts, or sprawling codebases without losing coherence.
Works natively with Claude Code, OpenCode, Kilo, and other popular scaffolds. Drop it in without rewiring your stack.
Strong multimodal results with significantly fewer tokens than closed-source competitors, which keeps production costs predictable.
A 1M-token context window and native four-modality input opens up workflows that simply weren't practical before. These are some of the patterns that map most directly to real production use.
Feed in entire PDFs — financials, legal contracts, research papers — alongside embedded charts and tables, and get structured extractions or cross-document analysis in a single call. The 1M context means you don't have to split documents into chunks and reassemble the answers.
Long-form video is finally tractable. Scene-level summarisation, entity tracking across a full-length interview, automated highlight reels from sports footage, meeting transcripts with action items — the temporal reasoning on Video-MME (87.7) translates directly into coherent long-video understanding.
Drop in product photography and get SEO-ready descriptions, attribute tags, and category suggestions at 1x token pricing. The cost delta versus premium closed-source models makes it viable to run on entire catalogs — not just flagship SKUs.
MiMo-V2.5 is designed to sit inside agent loops — not just answer questions at the end of one. It integrates with Claude Code, OpenCode, and Kilo, handles tool calls reliably, and keeps a coherent plan across long multi-step tasks without derailing into hallucination.
Scientific figures, financial charts, dense dashboards, handwritten equations — the 81.0 on CharXiv RQ reflects genuine precision. Useful anywhere that visual data needs to be understood rather than just described.
How is "native omnimodal" different from regular multimodal?
Most multimodal models process each modality through separate pipelines and combine the results. MiMo-V2.5 was trained across all four modalities simultaneously, which means it can reason about, say, the relationship between audio content and on-screen visuals in a video — something late-fusion models struggle with.
When should I choose MiMo-V2.5 over MiMo-V2.5-Pro?
For most production workloads — multimodal understanding, agentic tasks, everyday coding, document processing — MiMo-V2.5 delivers comparable output at half the cost. Upgrade to Pro when you need top-tier performance on long-horizon software engineering or the absolute ceiling on SWE-bench-class tasks.
Can I use MiMo-V2.5 inside my existing agent framework?
Yes. MiMo-V2.5 is compatible with Claude Code, OpenCode, and Kilo out of the box. It's also available via OpenRouter, making it straightforward to route into any stack that supports that interface.
What kinds of video content can it handle?
MiMo-V2.5 handles long-form video — scene tracking, temporal reasoning, visual grounding over minutes of footage. That covers meeting recordings, product demos, educational content, sports footage, and documentary-style material. The 87.7 Video-MME score reflects genuine comprehension, not just caption generation.
Released on April 22, 2026 by Xiaomi's AI team, MiMo-V2.5 picks up where MiMo-V2-Omni left off, with substantially better agentic performance, sharper visual reasoning, and native support for a 1 million token context window. It's designed for production — not benchmarks alone.
The model sits at the base of the V2.5 family. For teams that need the absolute ceiling on agentic tasks and software engineering, there's MiMo-V2.5-Pro. For general-purpose multimodal work at an honest price, MiMo-V2.5 is the practical starting point.
MiMo-V2.5 API Pricing:
Not a text model with adapters — all four modalities trained jointly from day one, so cross-modal reasoning is genuinely coherent.
The newer multimodal model beats the previous text-only flagship on general agentic benchmarks — at half the token cost.
No more chunking. Process entire research papers, long video transcripts, or sprawling codebases without losing coherence.
Works natively with Claude Code, OpenCode, Kilo, and other popular scaffolds. Drop it in without rewiring your stack.
Strong multimodal results with significantly fewer tokens than closed-source competitors, which keeps production costs predictable.
A 1M-token context window and native four-modality input opens up workflows that simply weren't practical before. These are some of the patterns that map most directly to real production use.
Feed in entire PDFs — financials, legal contracts, research papers — alongside embedded charts and tables, and get structured extractions or cross-document analysis in a single call. The 1M context means you don't have to split documents into chunks and reassemble the answers.
Long-form video is finally tractable. Scene-level summarisation, entity tracking across a full-length interview, automated highlight reels from sports footage, meeting transcripts with action items — the temporal reasoning on Video-MME (87.7) translates directly into coherent long-video understanding.
Drop in product photography and get SEO-ready descriptions, attribute tags, and category suggestions at 1x token pricing. The cost delta versus premium closed-source models makes it viable to run on entire catalogs — not just flagship SKUs.
MiMo-V2.5 is designed to sit inside agent loops — not just answer questions at the end of one. It integrates with Claude Code, OpenCode, and Kilo, handles tool calls reliably, and keeps a coherent plan across long multi-step tasks without derailing into hallucination.
Scientific figures, financial charts, dense dashboards, handwritten equations — the 81.0 on CharXiv RQ reflects genuine precision. Useful anywhere that visual data needs to be understood rather than just described.
How is "native omnimodal" different from regular multimodal?
Most multimodal models process each modality through separate pipelines and combine the results. MiMo-V2.5 was trained across all four modalities simultaneously, which means it can reason about, say, the relationship between audio content and on-screen visuals in a video — something late-fusion models struggle with.
When should I choose MiMo-V2.5 over MiMo-V2.5-Pro?
For most production workloads — multimodal understanding, agentic tasks, everyday coding, document processing — MiMo-V2.5 delivers comparable output at half the cost. Upgrade to Pro when you need top-tier performance on long-horizon software engineering or the absolute ceiling on SWE-bench-class tasks.
Can I use MiMo-V2.5 inside my existing agent framework?
Yes. MiMo-V2.5 is compatible with Claude Code, OpenCode, and Kilo out of the box. It's also available via OpenRouter, making it straightforward to route into any stack that supports that interface.
What kinds of video content can it handle?
MiMo-V2.5 handles long-form video — scene tracking, temporal reasoning, visual grounding over minutes of footage. That covers meeting recordings, product demos, educational content, sports footage, and documentary-style material. The 87.7 Video-MME score reflects genuine comprehension, not just caption generation.