The Ultimate Guide to Minimax Models 2026: M2.7, Music 2.6, Hailuo Video & Advanced TTS

From a self-evolving code agent that tops every major engineering benchmark to an AI music generator redefining what sub-bass sounds like — Minimax has quietly assembled one of the most complete multimodal model families available today. This is your comprehensive guide to all of it.

What is the Minimax model family?

Minimax is a global AI foundation model company that ships across every major modality: text reasoning and agentic coding, music generation with full structural control, studio-grade text-to-speech in 40 languages, and cinematic video synthesis. Where most providers specialize, Minimax delivers a tightly integrated ecosystem, every model is designed to work alongside the others.

The architecture underpinning the text side is a sparse Mixture of Experts (MoE) design with 230B total parameters and roughly 10B active at any given inference call. That balance between scale and efficiency is what allows the family to hit frontier benchmark numbers while remaining accessible on a pay-as-you-go basis through platforms like aimlapi.com.

Text & Chat
M2.7 · M2.7-highspeed · M2.5
Agentic · 204K ctx
Music Generation
Music 2.6 · Music 2.0
Cover · T2M · Full songs
TTS & Speech
Speech-2.8-hd · Speech-2.6
40 langs · 7 emotions
Hailuo Video
Hailuo 2.3 · 2.3Fast · 02
1080p · 24fps · T2V

Text & Chat

Minimax M2.7: the self-evolving agentic powerhouse

M2.7 is the flagship text and coding model, and its benchmark profile reads like a wishlist rather than a shipping product. The key architectural bet is recursive self-improvement: the model can evaluate its own outputs, identify failure modes, and refine its problem-solving strategy within a single session. That capability, combined with a 204,800-token context window and native tool calling, makes it unusually well-suited for long-horizon software engineering tasks.

Core specifications

Parameters
230B
~10B active (sparse MoE)
Context window
204K
tokens
Max output
131K
tokens
Tool-call accuracy
75.8%
multi-step agents

Benchmark performance

56.2% SWE-Pro (code)
55.6% VIBE-Pro
57% Terminal Bench
1495 GDPval-AA ELO
75.8% Tool calling accuracy

The GDPval-AA ELO of 1495 makes M2.7 the highest-ranked open-source-accessible model on that leaderboard as of writing. The SWE-Pro score places it firmly in frontier coding territory, competitive with models that cost significantly more per token.

What M2.7 actually does well

Beyond raw benchmark numbers, the model shows three distinct practical strengths worth knowing:

Polyglot software engineering

Handles code refactoring, bug diagnosis, and architectural planning across multiple languages in a single session — without losing context between steps.

Office document pipelines

Word, Excel, PowerPoint editing with 97% reported skill adherenc

Multi-agent orchestration

Purpose-built for complex environment interaction: tool calling, agentic loops, and real-time SRE incident response all benefit from the model's high-accuracy function calling.

Recursive self-improvement

The model evaluates its own intermediate outputs and adjusts strategy mid-task — a meaningful edge for iterative debugging or multi-step data analysis.

M2.7 vs M2.7-highspeed: choosing the right variant

Both variants deliver identical output quality. The difference is throughput. Standard M2.7 runs at approximately 60 tokens per second; the highspeed variant reaches around 100 TPS — roughly 3× faster in practice. For latency-sensitive workloads like streaming chat interfaces or real-time incident response, the premium is usually worth it. For batch processing or background agents, standard gives you more budget headroom.

Variant Speed (TPS) Intelligence Input price Output price Best for
M2.7 Standard ~60 TPS Full $0.39/M $1.56/M Complex reasoning, long-context agentic tasks
M2.7 Highspeed ~100 TPS · 3× faster Full $0.78/M $3.12/M Real-time chat, live SRE response, high-throughput pipelines

Music Generation

Minimax Music 2.6: cover reborn, bass redefined

Released into global beta on April 10, 2026, Music 2.6 is the most significant update to Minimax's audio generation stack since the original model launched. It ships two new capabilities — cover generation and enhanced low-frequency reproduction — alongside meaningful improvements to structural control, vocal realism, and instrument nuance. If you've been watching the AI music space closely, this is the model that changes the conversation.

What's actually new in 2.6

Cover generation (upload any song → extract melodic skeleton → restyle completely) and enhanced sub-bass that translates clearly on any playback device from studio monitors to phone speakers.

Feature breakdown

Cover generation

Upload an existing song. The model extracts its melodic skeleton and rebuilds it in any style, arrangement, or language you specify. A personal "Auld Lang Syne" remake in under 30 minutes.

Structural tag control

Use [Verse], [Chorus], [Bridge] tags alongside BPM, key signature, and emotional arc parameters to direct the composition's shape precisely.

Intent-aware composition

Describes tension, buildup, and release — the model responds to dramatic intent, not just genre labels. Useful for adaptive game audio and cinematic scoring.

Enhanced bass & low frequencies

Tight sub-bass and punchy drum transients that hold up on small speakers. Production-ready at 44.1kHz / 256kbps with first audio packet under 20 seconds.

Natural vocal imprecision

Deliberate micro-timing variation and subtle pitch drift for lo-fi, indie, and jazz styles, moving away from the uncanny perfection of earlier AI voices.

Instrument nuance

Vibrato, breath sounds, and dynamic variation on traditional instruments including erhu, dizi, and guzheng — the cultural authenticity gap narrows significantly here.

Audio output quality

Sample rate
44.1kHz
production standard
Bitrate
256kbps
release quality
First packet latency
<20s
to first audio
Max track length
5 min
per generation

Music 2.6 vs previous versions — what changed

Capability Music 2.0 Music 2.6 Improvement
Cover generation Not available Full melodic skeleton extraction New
Bass reproduction Muddy on small speakers Tight sub-bass, device-agnostic Significant
Structure control Genre & mood tags only [Verse]/[Chorus]/BPM/key/arc Significant
First packet latency ~40s <20s 2× faster
Output quality 192kbps / 44.1kHz 256kbps / 44.1kHz Higher bitrate
Vocal naturalness Smooth / clinical Micro-imprecision for organic feel Improved

TTS & Speech

Minimax TTS: 40 languages, 7 emotions, ultra-realistic voice cloning

The Speech-2.x series covers everything from real-time low-latency narration to high-fidelity studio-quality voice output. The key differentiator versus most commercial TTS systems is emotional range: seven distinct emotional registers (neutral, happy, sad, angry, fearful, disgusted, surprised) that can be blended and controlled per-sentence, not just per-request.

Voice cloning requires minimal reference audio — a 10-second sample is typically enough for reasonable timbre replication. Tonal nuance is preserved across languages, which matters significantly for tonal languages like Mandarin and Vietnamese where flat TTS systems introduce meaning errors.

Model Description Best For
Speech 2.8 HD Max fidelity, voice cloning, broadcast-quality output High-end audio production
Speech 2.8 Turbo Low-latency real-time use, streaming support Real-time applications & streaming
Speech 2.6 HD Previous generation, production-stable, cost-efficient Stable production workloads
Speech 2.6 Turbo Fast variant, ideal for chatbots and live agents Chatbots & live agents

Video

Minimax Hailuo: text-to-video and image-to-video at 1080p 24fps

The Hailuo family handles the visual end of the Minimax stack. Three variants cover different speed/quality trade-offs, all generating clips at 1080p and 24fps — a meaningful bar for short-form content where sharpness and motion smoothness matter far more than raw resolution numbers.

The standout technical capability is physics adherence: Hailuo's instruction-following on physical dynamics (rigid body interactions, fluid motion, soft materials) is among the best currently available from any API-accessible model. Facial emotion rendering is similarly strong — the model understands the difference between a tight smile and a genuine one.

Model Description Best For
🎬 Hailuo 2.3 Highest quality · SOTA physics · Full instruction following High-end video generation
⚡ Hailuo 2.3 Fast Faster generation · Lower cost · Great for iteration Rapid prototyping & iteration
◈ Hailuo 02 Image-to-video · Consistent subjects · Marketing ready Marketing & visual content

All variants produce 6–10 second clips, which aligns naturally with social content, product demos, and the transition segments that hold longer videos together. For projects requiring full scenes, clips can be chained with consistent character and environment descriptions.

Full family comparison: which model for which job

Model Modality Context / length Speed Primary use case
M2.7 Text 204K tokens ~60 TPS Complex reasoning, agentic coding
M2.7 Highspeed Text 204K tokens ~100 TPS Real-time, high-throughput agents
Music 2.6 Audio / Music Up to 5 min <20s first packet Game OST, covers, playlists, BGM
Speech 2.8 HD TTS / Voice Per request Standard Broadcast narration, voice cloning
Speech 2.8 Turbo TTS / Voice Per request Ultra-low latency Live agents, real-time chat voice
Hailuo 02 Video 6–10 sec clips Standard Marketing, social content, prototypes
Hailuo 2.3 Fast Video 6–10 sec clips Fast Rapid iteration, draft production

The full-stack creative pipeline

Every model in the family is designed to feed into the next. A complete content production workflow using only Minimax models looks like this:

M2.7: script & copy → Music 2.6: soundtrack → Speech 2.8: narration → Hailuo 02: video

Frequently asked questions

What makes M2.7 different from M2.5 - is the upgrade worth it?

The short answer: yes, especially for agentic and coding workloads. M2.7 introduces recursive self-improvement, the model can evaluate its own intermediate outputs within a session and adjust its strategy without human re-prompting. That's a meaningful capability leap for long-horizon tasks like multi-file refactors or data analysis pipelines.

On benchmarks, M2.7 scores 56.22% on SWE-Pro and holds the top ELO ranking (1495) among open-source-accessible models on GDPval-AA. M2.5 is still a solid choice for simpler chat and summarization tasks where the self-evolution overhead isn't needed, but for anything involving tool calls, multi-step agents, or extended coding sessions, M2.7 is the right pick.

When should I use M2.7-highspeed vs the standard M2.7?

Both variants produce identical output quality, this is a throughput decision, not a quality one. Highspeed runs at approximately 100 tokens per second versus the standard ~60 TPS, roughly a 3× difference in wall-clock latency for long responses.

Use minimax/minimax-m2-7-highspeed when: you're building a streaming chat interface, running real-time SRE incident response, or any workflow where the user is waiting on the response. Use standard minimax/minimax-m2.7 for batch jobs, background agents, or any task that runs overnight — you'll cut your token spend roughly in half with no quality trade-off.

What do the structural tags like [Verse] and [Chorus] actually control?

The structural tags in your prompt directly shape the composition's architecture. When you write [Verse], the model generates material with the characteristics of a verse: lower energy, narrative lyrics, melodic restraint. [Chorus] triggers higher energy, hook-forward melody, and fuller arrangement. [Bridge] introduces harmonic contrast and tension before resolution.

Alongside the tags, you can specify BPM (numeric), key signature, and emotional arc descriptors like "oppressive buildup" or "euphoric release." The model's intent-aware composition system responds to these dramatic cues, not just genre keywords, which is what makes it genuinely useful for adaptive game audio where mood needs to shift on cue.

Can Music 2.6 generate full songs with vocals, or just instrumentals?

Both. Music 2.6 generates complete tracks with vocals and instrumentals in a single API call. It also handles auto-lyrics generation, if you supply a style and structure without writing the lyrics yourself, the model generates them to fit the musical content and emotional arc you've described.

For instrumentals-only use cases (game BGM, podcast beds, background music), you can specify that in your prompt tags and the model will generate without vocal lines. Output quality is production-ready at 256kbps / 44.1kHz across both modes, with first audio arriving in under 20 seconds.

What's the practical difference between speech-2.8-hd and speech-2.8-turbo?

HD is optimized for fidelity — longer outputs, richer timbre, better handling of expressive speech. It's the right choice for podcast narration, audiobook production, video voiceover, and any context where quality is noticed.

Turbo is optimized for latency — it's designed to stream audio in near-real-time for conversational AI interfaces, phone agents, and live applications where the first audio byte needs to arrive within milliseconds, not seconds. The quality trade-off in turbo is audible on close listening but imperceptible in fast conversational exchanges. Use HD for recorded content, turbo for live interaction.

What makes Hailuo's physics handling stand out versus other video models?

Most text-to-video models handle physics as an afterthought, they can generate a person walking convincingly but fall apart on anything requiring real dynamic simulation: liquid pouring, objects colliding, cloth moving, fire behaving correctly. Hailuo 2.3 was specifically trained on physics-heavy scenarios, which is why its instruction following on dynamic content is substantially more reliable than comparable models.

The facial emotion rendering is the other standout. The model distinguishes between subtle emotional states rather than mapping everything to exaggerated expressions, which matters for product demos, brand content, and any video where a character needs to convey a specific feeling naturally rather than theatrically.

Share with friends

Ready to get started? Get Your API Key Now!

Get API Key