MiniMax Audio: Voices from China
Overview
The MiniMax Audio model (also referred to as Speech-01 in some documentation and press releases) is a product of the innovative Chinese startup MiniMax, founded by former members of the SenseTime Group. MiniMax has made a name for itself in just the span of 2024 by crafting cutting-edge multimodal artificial intelligence models, excelling in areas like music creation and video generation.
In November 2024, MiniMax consolidated its AI services into a unified platform, making tools like video generators, chatbots, and music neural networks easily accessible to users. Developers were also provided with APIs to seamlessly integrate these services into their own applications. Notably, MiniMax Music — an AI-powered model capable of crafting music based on textual prompts — has already garnered significant attention from the tech and creative communities.
MiniMax continues to push the boundaries of AI, drawing in substantial investments to broaden the scope of its technologies. The company's mission revolves around equipping users with innovative tools that redefine how multimedia content is created and processed.
Core capabilities
MiniMax Audio is built to handle a wide range of audio-related tasks through a single model interface.
What sets MiniMax Audio apart from other TTS models
Ultra-long text synthesis — up to 10 million characters
Most TTS models cap out at around 100,000 characters per request. MiniMax Audio supports inputs up to 10 million characters — a 100x improvement that makes it practical for audiobook generation, full podcast production, and large-scale content automation without batching or stitching.
Emotional intelligence built in
The model scans input text for emotional cues and automatically adjusts delivery. Whether the script is a cheerful product announcement or a sombre documentary narration, MiniMax Audio picks up on those signals without you needing to manually insert SSML tags or emotion markers.
5-second voice cloning
Feed the model a short audio sample and it produces a cloned voice in roughly 5 seconds. This opens up real-time personalisation workflows that simply aren't possible with slower cloning pipelines.
Thousands of customisable voices
The model ships with an extensive voice library. You can choose by gender, language, accent, and tone — and blend characteristics to create entirely new voice profiles suited to your brand or product.
OpenAI Set to New Search
The manufacturer also claims that the quality of generation is based on the unique training characteristics of the model: “While traditional Text-to-Speech AI relies on fixed pronunciation dictionaries and predefined parameters, Speech-01 is trained on millions of hours of high-quality audio data. This allows it to grasp subtle nuances like accents, speech habits, and pitch variations, resulting in more natural and contextually aware speech.”
This model is built to serve as a flexible tool for businesses and developers who require advanced solutions for voice and audio processing. Whether it's for enhancing customer experiences, automating workflows, or creating new auditory content, the MiniMax Audio model offers a robust foundation to build upon.
Language Support
For text-to-speech generation on the official website, a wide range of languages is available: English, Chinese, Japanese, Korean, Spanish, Portuguese, German, French, Indonesian, Russian, and Italian. English and Chinese are offered with multiple dialects and accents for enhanced flexibility.
However, when accessed via API, the declared language support is currently limited to English, Chinese, Japanese, and Korean.
Use cases
Audiobooks & long-form narration
Handle full manuscripts in a single API call — no batching needed.
Customer support IVR
Deploy natural-sounding voices for automated phone and chat systems.
Localisation at scale
Convert written content into speech across 11 languages for global audiences.
Podcast & video production
Generate voiceovers and narration without booking studio time.
E-learning platforms
Create emotionally engaging educational audio from written course material.
Accessibility tools
Convert web content and documents into lifelike speech for visually impaired users.
Competitors Overview
Of course, MiniMax Audio will have to compete with many previously launched models. This market is highly competitive, with numerous TTS models developed by IT giants such as Google, Microsoft, and Amazon:

Unfortunately, there is currently no publicly available information about MiniMax Audio being tested in well-known benchmarks for speech synthesis quality evaluation. This might be due to MiniMax Audio being a relatively new or specialized solution that has not yet undergone independent assessments.
Availability
At present, the model's capabilities can be tested in a sandbox environment available on the official website:

The user can input their text, select one of several dozen available voices (with filters for the desired language, gender, and accent), and listen to their text in the chosen voice.
For more detailed adjustments, the panel on the right provides controls ranging from basic options like Pitch and Speed to advanced sliders along scales such as “Deepen – Lighten,” “Stronger – Softer,” and “Nasal – Crisp”:

Additionally, there are options to add echo and specific effects, such as simulating a LoFi phone or a robotic voice.
Most of the model's functions are already accessible through the API. In the very near future, as always, a detailed description of all available API parameters will appear on our website.
Conclusions
Well, this month saw the arrival of a new TTS model on the market. The developer claims it has been trained on higher-quality data compared to most competitor models and promises impressive performance. However, until it appears on TTS benchmark websites with real comparisons, it’s hard to say for sure.
Online users have highlighted the lively and interesting voices available on the list, as well as a barely noticeable difference when assigning different emotional tones to the same voice.
We’ll be watching closely to see what niche MiniMax Audio can carve out in the already crowded global TTS market.
Frequently asked questions
How does the 10 million character limit work in practice?
The model accepts up to 10 million characters in a single synthesis request. A full-length novel is roughly 500,000–700,000 characters, so the limit is rarely a constraint for typical production use cases.
Can I clone a voice using the API?
Yes. Voice cloning is available via the API. You provide a short audio sample and the model generates a cloned voice profile in approximately 5 seconds, which you can then use for subsequent synthesis requests.
What languages work through the API?
The API currently supports English, Chinese, Japanese, and Korean. The web interface extends this to 11 languages. API language support is expected to expand as the model matures.
Are there published benchmarks for MiniMax Audio?
Not yet. MiniMax Audio hasn't appeared in major independent TTS benchmarks as of early 2025. That said, early user feedback consistently highlights its natural delivery, lively voice selection, and convincing emotional tone variation.
.png)


