MiniMax Audio: Voices from China

The year wraps up with yet another exciting novelty — Chinese developer Hailuo AI introduces "a hyper-realistic and multi-emotion Text-To-Speech model" alongside "Voice Cloning within 5 seconds". Shall we take a closer look at the MiniMax Audio Model?

Overview

The MiniMax Audio model (also referred to as Speech-01 in some documentation and press releases) is a product of the innovative Chinese startup MiniMax, founded by former members of the SenseTime Group. MiniMax has made a name for itself in just the span of 2024 by crafting cutting-edge multimodal artificial intelligence models, excelling in areas like music creation and video generation.

In November 2024, MiniMax consolidated its AI services into a unified platform, making tools like video generators, chatbots, and music neural networks easily accessible to users. Developers were also provided with APIs to seamlessly integrate these services into their own applications. Notably, MiniMax Music — an AI-powered model capable of crafting music based on textual prompts — has already garnered significant attention from the tech and creative communities.

MiniMax continues to push the boundaries of AI, drawing in substantial investments to broaden the scope of its technologies. The company's mission revolves around equipping users with innovative tools that redefine how multimedia content is created and processed.

Core capabilities

MiniMax Audio is built to handle a wide range of audio-related tasks through a single model interface.

🎙
Speech synthesis (TTS)
Convert long-form text into natural, emotionally expressive speech at scale.
🔁
Voice cloning
Clone any voice in as little as 5 seconds with high fidelity to original tone, rhythm, and accent.
🧠
Emotion-aware output
Automatically adapts tone — joy, melancholy, excitement — based on semantic context in the input text.
🔊
Audio enhancement
Reduces noise and improves audio quality across source material, useful for post-processing workflows.
📝
Speech recognition
Accurate transcription of spoken language into text, supporting multiple languages and accents.
🎛
Fine-grained voice control
Adjust pitch, speed, nasal depth, breathiness, and more using advanced synthesis parameters.

What sets MiniMax Audio apart from other TTS models

Ultra-long text synthesis — up to 10 million characters

Most TTS models cap out at around 100,000 characters per request. MiniMax Audio supports inputs up to 10 million characters — a 100x improvement that makes it practical for audiobook generation, full podcast production, and large-scale content automation without batching or stitching.

Emotional intelligence built in

The model scans input text for emotional cues and automatically adjusts delivery. Whether the script is a cheerful product announcement or a sombre documentary narration, MiniMax Audio picks up on those signals without you needing to manually insert SSML tags or emotion markers.

5-second voice cloning

Feed the model a short audio sample and it produces a cloned voice in roughly 5 seconds. This opens up real-time personalisation workflows that simply aren't possible with slower cloning pipelines.

Thousands of customisable voices

The model ships with an extensive voice library. You can choose by gender, language, accent, and tone — and blend characteristics to create entirely new voice profiles suited to your brand or product.

OpenAI Set to New Search

The manufacturer also claims that the quality of generation is based on the unique training characteristics of the model: “While traditional Text-to-Speech AI relies on fixed pronunciation dictionaries and predefined parameters, Speech-01 is trained on millions of hours of high-quality audio data. This allows it to grasp subtle nuances like accents, speech habits, and pitch variations, resulting in more natural and contextually aware speech.”

This model is built to serve as a flexible tool for businesses and developers who require advanced solutions for voice and audio processing. Whether it's for enhancing customer experiences, automating workflows, or creating new auditory content, the MiniMax Audio model offers a robust foundation to build upon.

Language Support

For text-to-speech generation on the official website, a wide range of languages is available: English, Chinese, Japanese, Korean, Spanish, Portuguese, German, French, Indonesian, Russian, and Italian. English and Chinese are offered with multiple dialects and accents for enhanced flexibility.

However, when accessed via API, the declared language support is currently limited to English, Chinese, Japanese, and Korean.

Use cases

Audiobooks & long-form narration

Handle full manuscripts in a single API call — no batching needed.

Customer support IVR

Deploy natural-sounding voices for automated phone and chat systems.

Localisation at scale

Convert written content into speech across 11 languages for global audiences.

Podcast & video production

Generate voiceovers and narration without booking studio time.

E-learning platforms

Create emotionally engaging educational audio from written course material.

Accessibility tools

Convert web content and documents into lifelike speech for visually impaired users.

Competitors Overview

Of course, MiniMax Audio will have to compete with many previously launched models. This market is highly competitive, with numerous TTS models developed by IT giants such as Google, Microsoft, and Amazon:

Source: artificialanalysis.ai

Unfortunately, there is currently no publicly available information about MiniMax Audio being tested in well-known benchmarks for speech synthesis quality evaluation. This might be due to MiniMax Audio being a relatively new or specialized solution that has not yet undergone independent assessments.

Availability

At present, the model's capabilities can be tested in a sandbox environment available on the official website: 

Source: hailuo.ai

The user can input their text, select one of several dozen available voices (with filters for the desired language, gender, and accent), and listen to their text in the chosen voice.

For more detailed adjustments, the panel on the right provides controls ranging from basic options like Pitch and Speed to advanced sliders along scales such as “Deepen – Lighten,” “Stronger – Softer,” and “Nasal – Crisp”:

Source: hailuo.ai

Additionally, there are options to add echo and specific effects, such as simulating a LoFi phone or a robotic voice.

Most of the model's functions are already accessible through the API. In the very near future, as always, a detailed description of all available API parameters will appear on our website.

Conclusions

Well, this month saw the arrival of a new TTS model on the market. The developer claims it has been trained on higher-quality data compared to most competitor models and promises impressive performance. However, until it appears on TTS benchmark websites with real comparisons, it’s hard to say for sure.

Online users have highlighted the lively and interesting voices available on the list, as well as a barely noticeable difference when assigning different emotional tones to the same voice.

We’ll be watching closely to see what niche MiniMax Audio can carve out in the already crowded global TTS market.

Frequently asked questions

How does the 10 million character limit work in practice?

The model accepts up to 10 million characters in a single synthesis request. A full-length novel is roughly 500,000–700,000 characters, so the limit is rarely a constraint for typical production use cases.

Can I clone a voice using the API?

Yes. Voice cloning is available via the API. You provide a short audio sample and the model generates a cloned voice profile in approximately 5 seconds, which you can then use for subsequent synthesis requests.

What languages work through the API?

The API currently supports English, Chinese, Japanese, and Korean. The web interface extends this to 11 languages. API language support is expected to expand as the model matures.

Are there published benchmarks for MiniMax Audio?

Not yet. MiniMax Audio hasn't appeared in major independent TTS benchmarks as of early 2025. That said, early user feedback consistently highlights its natural delivery, lively voice selection, and convincing emotional tone variation.

Share with friends

Ready to get started? Get Your API Key Now!

Get API Key