News
December 25, 2024

MiniMax Audio: Voices from China

The year wraps up with yet another exciting novelty — Chinese developer Hailuo AI introduces "a hyper-realistic and multi-emotion Text-To-Speech model" alongside "Voice Cloning within 5 seconds". Shall we take a closer look at the MiniMax Audio Model?

Overview

The MiniMax Audio model (also referred to as Speech-01 in some documentation and press releases) is a product of the innovative Chinese startup MiniMax, founded by former members of the SenseTime Group. MiniMax has made a name for itself in just the span of 2024 by crafting cutting-edge multimodal artificial intelligence models, excelling in areas like music creation and video generation.

In November 2024, MiniMax consolidated its AI services into a unified platform, making tools like video generators, chatbots, and music neural networks easily accessible to users. Developers were also provided with APIs to seamlessly integrate these services into their own applications. Notably, MiniMax Music — an AI-powered model capable of crafting music based on textual prompts — has already garnered significant attention from the tech and creative communities.

MiniMax continues to push the boundaries of AI, drawing in substantial investments to broaden the scope of its technologies. The company's mission revolves around equipping users with innovative tools that redefine how multimedia content is created and processed.

What Makes the MiniMax Audio Model Stand Out?

Unlike the previously launched MiniMax Music tool, the MiniMax Audio model is designed to work with voice and audio data in a broader sense. It is a versatile multimodal AI designed to tackle a broad spectrum of audio-related tasks. The capabilities include:

  • Speech Recognition and Transcription: Accurately converts spoken words into text.
  • Audio Enhancement: Improves sound quality by reducing noise and other imperfections.
  • Speech Synthesis: Generates lifelike voice outputs for various applications. The competitive advantages of the MiniMax Audio model here include the following:
    • Emotional Intelligence: Speech-01 excels in capturing and conveying intricate human emotions, tones, and even laughter. By analyzing text for emotional cues, it generates speech that feels authentically human.
    • Contextual Understanding: The model intuitively adapts its tone to match the emotional depth of the message, whether it's joy, excitement, or melancholy, ensuring a more immersive listening experience.
    • High-Quality Performance: The model preserves the essence of original voices, including unique rhythms, accents, and personality quirks, making it an invaluable tool for broadcasters, educators, and content creators.
    • Ultra-Long Text Synthesis: Unlike most models that are limited to 100,000 characters, MiniMax Audio supports up to 10 million characters in a single output. 
    • Customizable Voices: With the ability to replicate thousands of distinct voices, MiniMax Audio model allows users to blend characteristics effortlessly, creating a rich palette of voice styles, emotions, and tones.

According to the developers, all of this happens incredibly quickly. For example, the voice cloning process reportedly takes just 5 seconds.

The manufacturer also claims that the quality of generation is based on the unique training characteristics of the model: “While traditional Text-to-Speech AI relies on fixed pronunciation dictionaries and predefined parameters, Speech-01 is trained on millions of hours of high-quality audio data. This allows it to grasp subtle nuances like accents, speech habits, and pitch variations, resulting in more natural and contextually aware speech.”

This model is built to serve as a flexible tool for businesses and developers who require advanced solutions for voice and audio processing. Whether it's for enhancing customer experiences, automating workflows, or creating new auditory content, the MiniMax Audio model offers a robust foundation to build upon.

Language Support

For text-to-speech generation on the official website, a wide range of languages is available: English, Chinese, Japanese, Korean, Spanish, Portuguese, German, French, Indonesian, Russian, and Italian. English and Chinese are offered with multiple dialects and accents for enhanced flexibility.

However, when accessed via API, the declared language support is currently limited to English, Chinese, Japanese, and Korean.

Competitors Overview

Of course, MiniMax Audio will have to compete with many previously launched models. This market is highly competitive, with numerous TTS models developed by IT giants such as Google, Microsoft, and Amazon:

Source: artificialanalysis.ai

Unfortunately, there is currently no publicly available information about MiniMax Audio being tested in well-known benchmarks for speech synthesis quality evaluation. This might be due to MiniMax Audio being a relatively new or specialized solution that has not yet undergone independent assessments.

Availability

At present, the model's capabilities can be tested in a sandbox environment available on the official website: 

Source: hailuo.ai

The user can input their text, select one of several dozen available voices (with filters for the desired language, gender, and accent), and listen to their text in the chosen voice.

For more detailed adjustments, the panel on the right provides controls ranging from basic options like Pitch and Speed to advanced sliders along scales such as “Deepen – Lighten,” “Stronger – Softer,” and “Nasal – Crisp”:

Source: hailuo.ai

Additionally, there are options to add echo and specific effects, such as simulating a LoFi phone or a robotic voice.

Most of the model's functions are already accessible through the API. In the very near future, as always, a detailed description of all available API parameters will appear on our website.

Conclusions

Well, this month saw the arrival of a new TTS model on the market. The developer claims it has been trained on higher-quality data compared to most competitor models and promises impressive performance. However, until it appears on TTS benchmark websites with real comparisons, it’s hard to say for sure.

Online users have highlighted the lively and interesting voices available on the list, as well as a barely noticeable difference when assigning different emotional tones to the same voice.

We’ll be watching closely to see what niche MiniMax Audio can carve out in the already crowded global TTS market.

Get API Key