upd

April 16, 2026

min

MiniMax Music Cover: Create Full AI Song Covers in Seconds

Upload any song, write a style prompt, and let the model rebuild the entire track from scratch — same melody, completely different world. Here's everything you need to know.

What is MiniMax Music Cover and how does it work?

Most AI music tools start from a blank page, you write a prompt and something entirely new comes out. MiniMax Music Cover takes a fundamentally different approach. It starts from a song you already own and rebuilds it, stem by stem, in any genre or style you describe. The original melody stays. Everything else — the instruments, the vocal character, the arrangement, the mix — gets remade from scratch.

The key distinction from voice cloning or sampling tools is this: the model doesn't lift audio from the source recording. It analyzes the melodic skeleton, harmonic framework, and rhythmic signature of your input track, then uses that extracted structure as a compositional anchor to generate a brand-new performance.

`How the pipeline works`

‍The model runs in three stages: (1) melodic extraction — the source audio is processed to isolate the core tune, chords, and rhythm. (2) Style conditioning — your prompt describes the target sound in 10–300 characters ("Jazz, upright bass, warm female vocal, slow swing tempo"). (3) Full regeneration — instrumentation, vocals, dynamics, and final mix are generated fresh in the target style. If you don't provide new lyrics, the model runs an ASR pass and auto-extracts them.

Released as part of MiniMax Music 2.6, the Cover feature represents a meaningful leap from earlier generation models. Where Music 2.0 was about generating original tracks from lyrics and prompts, Music 2.6 adds controlled transformation — a much harder technical problem. The difference between generating something new and faithfully transforming something that already exists is roughly the gap between a painter and a conservator.

Inputs the model accepts

The music-cover API endpoint takes three inputs: a reference audio file (MP3, WAV, or M4A), a style prompt of 10–300 characters, and an optional lyrics override. If you leave lyrics blank, auto-extraction kicks in. The style prompt is where the creative work happens — the more specific, the more controlled the output.

What the model produces

The output is a complete song in your target style: full vocals with natural phrasing and breath control, layered instrumentation, stereo mix at 44.1 kHz CD-quality, downloadable as MP3 or WAV. Tracks run up to four minutes. Every output is royalty-free and cleared for commercial use — YouTube, ads, podcasts, games, sync licensing, the works.

Key features of MiniMax Music 2.6 (Cover edition)

Music 2.6 isn't just a versioning bump. It restructures how the model handles control, latency, and audio fidelity across every layer of the pipeline. Here's what actually changed and why it matters for developers and creators building on top of it.

Melody preservation core

The model extracts the melodic skeleton from the source and keeps it intact through the full regeneration. Drift, where AI outputs wander away from the original tune, has been eliminated by design.

100+ instrument palette

From Fender Rhodes to guzheng to Roland 808s — the model can reconstruct your reference track with an expanded range of real and synthetic instruments, requested through the style prompt.

First output in under 20 seconds

Generation latency dropped significantly in 2.6. The first playable output typically lands in under 20 seconds, enabling fast iteration across multiple style variations.

Human-like emotional vocals

The vocal synthesis captures breathing, vibrato, and micro-timing — the subtle physical cues that separate convincing AI vocals from obviously synthetic ones.

Enhanced sub-bass reproduction

Low-frequency handling was rebuilt for 2.6, ensuring the bass translates cleanly whether the listener is on studio monitors or phone speakers — a persistent weak point in earlier generation models.

Structural tag control

Use [Verse], [Chorus], [Bridge], and nine other structure tags in your lyrics to direct the compositional shape with surgical precision, not just vibe, but actual song architecture.

Comparing Music 2.0 and Music 2.6 (Cover)

Music 2.0 was a text-to-song model — give it lyrics and a style prompt, and it creates an original track. Music 2.6 adds two things Music 2.0 couldn't do: the Cover transformation mode (source audio in, restyled song out) and the Lyrics Optimizer (auto-generates lyrics from the style prompt). For most cover-generation use cases, you want the music-cover model specifically.

Advanced options worth knowing

‍Custom lyrics: Pass new lyrics in the lyrics field using structural tags like [Verse], [Chorus], [Bridge] to define the song's shape. This is how you repurpose a familiar melody with your brand's messaging.‍
Streaming output: Set "stream": true and "output_format": "hex" to receive audio data progressively — useful for apps where you want to start playback before generation is fully complete.‍
Instrumental mode: Available in music-2.6 (not the cover endpoint directly), this removes vocals entirely — useful for creating background tracks from a melodic reference.

Real-world use cases

The melody-preserving style transfer capability of MiniMax Music Cover opens workflows that simply weren't possible with earlier generation tools. Here's how different teams are putting it to use.

`Content`

YouTube & TikTok creators

Transform recognisable melodies into original-sounding tracks that drive recognition without copyright flags. Same hook, different sound.

`Marketing`

Brand jingle remixes

Take a familiar melody and rewrite it with brand-specific lyrics, then restyle it to match campaign tone from upbeat retail to cinematic luxury.

`Music`

Indie producers

Prototype cover versions across genres in minutes. Hear what your original song sounds like as a bossa nova, as a synth-pop track, as an acoustic folk piece.

`Games / Film`

Adaptive soundtracks

Take a single theme and generate style variations — tension, triumph, melancholy — all melodically consistent, ideal for scene scoring and dynamic game audio.

`Education`

Music theory tools

Show students how genre context transforms a piece. One melody, six genre interpretations — generated on demand, no studio time required.

`Dev`

Agent-based audio pipelines

Music 2.6 ships a native Music Agent Skill, enabling AI agents to trigger cover generation within broader automated content creation workflows.

MiniMax Music Cover vs. the competition (2026)

The AI music landscape in 2026 has a handful of credible players, but most focus on original generation rather than source-to-cover transformation. Here's how MiniMax stacks up across the dimensions that matter most for developers.

Model	Cover mode	Max length	Vocal realism	Commercial use
MiniMax Music Cover	Native	4 min	Excellent	✓ Royalty-free
Suno v4	Limited	~4 min	Very good	Plan-dependent
Udio v2	—	~3 min	Very good	Plan-dependent
Stable Audio 2.0	—	3 min	Moderate	✓ Yes
Soundverse	Partial	Varies	Good	✓ Yes

The important nuance here: Suno and Udio remain strong for original song generation, particularly for pop and mainstream genres where they have extensive training data. Where MiniMax Music Cover wins decisively is in the transformation use case, taking a melody that already exists and faithfully restoring it in a new style.

Limitations & pro tips

No model is unlimited. Here's an honest accounting of where MiniMax Music Cover has constraints, and practical ways to work around each one.

Four-minute track ceiling

Single generations are capped at roughly four minutes. For longer projects — a full five-minute song, an extended podcast intro, a film cue — generate in two parts and stitch them at an edit point. Pick the edit where the model's melodic trajectory naturally resolves (end of a chorus, end of a bridge) for the cleanest splice.

Style prompt character limit

The prompt window is 10–300 characters. It sounds tight, but it's plenty if you're precise. Focus on: genre name, one or two key instruments, vocal gender and character, and one mood word. Anything beyond that adds noise rather than control. "Blues, slide guitar, gravel-voiced male, slow and late-night" outperforms a 280-character paragraph.

Niche and regional genre coverage

The model excels at globally popular styles and contemporary genres. Highly regional styles or avant-garde subgenres with limited training representation may output something more generic. If you're targeting a niche style, anchor the prompt with better-known adjacent references: "similar to early 2000s UK garage" grounds the model better than a genre name it may not have seen frequently.

Lyrics auto-extraction quality varies

The ASR-based auto-extraction works well for clear vocal recordings in English. Tracks with heavy reverb, multi-part harmonics, or non-English lyrics may extract imperfectly. For precise lyric control, always pass your own lyrics in the request rather than relying on auto-extraction.

`Prompt engineering cheatsheet`

‍Iterate fast: run three or four style prompt variations in parallel and compare results. Small changes — "bright female vocal" vs "husky female vocal", "upright bass" vs "electric bass" — produce meaningfully different outputs. The model is more responsive to specific instrumental vocabulary than to mood adjectives alone.

Frequently asked questions

What is the difference between MiniMax Music 2.0 and Music 2.6?

Music 2.0 is a text-to-song model — it generates original music from lyrics and a style prompt. Music 2.6 adds two capabilities that 2.0 lacked: the Cover mode (transform an existing song into a new style while preserving its melody) and the Lyrics Optimizer (auto-generate lyrics from your style prompt). Music 2.6 also delivers significantly faster generation and enhanced low-frequency audio quality.

Can I use MiniMax Music Cover output commercially?

Yes. Every output generated by the music-cover model is a brand-new AI composition, it doesn't sample or reproduce the original recording. All outputs are royalty-free and cleared for commercial use including YouTube, TikTok, ads, podcasts, games, sync licensing, and broadcast. Download in MP3 or WAV format.

How long does generation take?

Music 2.6 was optimised specifically for latency. The first output typically arrives in under 20 seconds, with a full 4-minute track completing in roughly 60–90 seconds depending on server load. Using the music-cover paid endpoint (versus the free tier) gives you priority queue access and consistently faster turnaround.

Do I need to provide lyrics, or does the model handle them?

Lyrics are optional. If you leave the lyrics field blank, the model runs an ASR (automatic speech recognition) pass on the source audio and extracts the lyrics automatically. If you want to change the lyrics — for a brand campaign, translated version, or entirely new narrative — pass your own lyrics in the request. Structural tags like [Verse], [Chorus], and [Bridge] let you direct the arrangement shape.

What genres does MiniMax Music Cover handle best?

The model covers the full range of mainstream and contemporary genres well: Pop, Jazz, Blues, Rock, Hip Hop, Electronic, Bossa Nova, Folk, Cinematic orchestral, Lo-fi, and more. It has particularly strong performance for globally distributed styles with high training data coverage. Very niche regional genres or avant-garde subgenres may produce more generic results.

Example H2

Share with friends

Ready to get started? Get Your API Key Now!

Get API Key