

From instant voice cloning to multilingual emotional range, it’s the previous‑generation flagship for creators who need broadcast‑grade speech without the studio.
Built as part of Inworld’s next-generation voice AI stack, it delivers human-like speech synthesis with real-time responsiveness and advanced voice customization. Whether you're building conversational agents, immersive games, or scalable voice interfaces, TTS-1.5-Max offers a rare combination of quality, speed, and cost efficiency.
It is designed to serve as the default choice for most production use cases, balancing premium audio quality with real-time performance.
Feed it just 5–10 seconds of someone speaking, and TTS-1.5-Max captures the essence — timbre, pace, and delivery style. No fine‑tuning, no retraining. It's ideal for maintaining a consistent narrator across a 10‑hour audiobook or keeping an NPC’s voice stable across a massive game script.
Beyond basic SSML, you can guide emotions, insert natural breaths, laughter, sighs, or even subtle vocal fry. The model interprets these cues and renders them convincingly. Perfect for scenes that need a genuine reaction, not just flat dialogue.
English, Spanish, French, German, Italian, Portuguese, Polish, Hindi, Japanese, Korean, and Mandarin — all from the same API. Accents and code‑switching? Handled with surprising grace. Great for localization without juggling separate voice engines.
Traditional TTS often swaps between pre‑recorded units. Inworld’s model uses in‑context learning, it treats the reference audio as a dynamic prompt that steers the autoregressive generation. It preserves the speaker’s unique cadence across different emotional states and languages, which is nearly impossible with fixed voice profiles.
Most speech APIs operate at 16 kHz or 24 kHz; TTS‑1.5‑MAX renders at studio‑grade 48 kHz. The autoregressive approach generates audio waveform tokens sequentially, balancing fine detail with acceptable latency. It’s the reason whispered dialogue sounds intimate and shouted lines don’t clip or distort.
With 8.8B parameters — more than 5× the capacity of TTS‑1 (1.6B) — this model adds layers of richness and dynamic range. In practice, that means warmer voices, better consonant clarity, and more believable emotional arcs. The trade‑off? Inference is a bit heavier. It shines in pre‑rendered workflows (cutscenes, audiobooks, voiceovers) rather than ultra‑low‑latency real‑time chat.
You can prompt for chuckles, gasps, sighs, and even breathing patterns. Developers working on interactive narrative games use this to make characters feel present — a quiet sigh before delivering bad news, or a sharp inhale when surprised. The model interprets these cues naturally.
If you're producing an audiobook, documentary narration, or a branded explainer series, this model eliminates the "robotic sheen" common in TTS. The ability to clone a narrator’s voice once and then generate hours of consistent, expressive audio is a massive time saver. Post‑production teams use it to iterate on dialogue without costly re‑recording sessions.
Game studios lean on TTS-1.5‑MAX for dynamic NPC barks and monologues. Because the model understands markup and emotion, a character can deliver the same line in a threatening whisper or a cheerful shout. Combined with voice cloning, you can have an entire cast of distinct voices generated from a handful of reference samples — perfect for indie teams or sprawling open‑world projects.
Built as part of Inworld’s next-generation voice AI stack, it delivers human-like speech synthesis with real-time responsiveness and advanced voice customization. Whether you're building conversational agents, immersive games, or scalable voice interfaces, TTS-1.5-Max offers a rare combination of quality, speed, and cost efficiency.
It is designed to serve as the default choice for most production use cases, balancing premium audio quality with real-time performance.
Feed it just 5–10 seconds of someone speaking, and TTS-1.5-Max captures the essence — timbre, pace, and delivery style. No fine‑tuning, no retraining. It's ideal for maintaining a consistent narrator across a 10‑hour audiobook or keeping an NPC’s voice stable across a massive game script.
Beyond basic SSML, you can guide emotions, insert natural breaths, laughter, sighs, or even subtle vocal fry. The model interprets these cues and renders them convincingly. Perfect for scenes that need a genuine reaction, not just flat dialogue.
English, Spanish, French, German, Italian, Portuguese, Polish, Hindi, Japanese, Korean, and Mandarin — all from the same API. Accents and code‑switching? Handled with surprising grace. Great for localization without juggling separate voice engines.
Traditional TTS often swaps between pre‑recorded units. Inworld’s model uses in‑context learning, it treats the reference audio as a dynamic prompt that steers the autoregressive generation. It preserves the speaker’s unique cadence across different emotional states and languages, which is nearly impossible with fixed voice profiles.
Most speech APIs operate at 16 kHz or 24 kHz; TTS‑1.5‑MAX renders at studio‑grade 48 kHz. The autoregressive approach generates audio waveform tokens sequentially, balancing fine detail with acceptable latency. It’s the reason whispered dialogue sounds intimate and shouted lines don’t clip or distort.
With 8.8B parameters — more than 5× the capacity of TTS‑1 (1.6B) — this model adds layers of richness and dynamic range. In practice, that means warmer voices, better consonant clarity, and more believable emotional arcs. The trade‑off? Inference is a bit heavier. It shines in pre‑rendered workflows (cutscenes, audiobooks, voiceovers) rather than ultra‑low‑latency real‑time chat.
You can prompt for chuckles, gasps, sighs, and even breathing patterns. Developers working on interactive narrative games use this to make characters feel present — a quiet sigh before delivering bad news, or a sharp inhale when surprised. The model interprets these cues naturally.
If you're producing an audiobook, documentary narration, or a branded explainer series, this model eliminates the "robotic sheen" common in TTS. The ability to clone a narrator’s voice once and then generate hours of consistent, expressive audio is a massive time saver. Post‑production teams use it to iterate on dialogue without costly re‑recording sessions.
Game studios lean on TTS-1.5‑MAX for dynamic NPC barks and monologues. Because the model understands markup and emotion, a character can deliver the same line in a threatening whisper or a cheerful shout. Combined with voice cloning, you can have an entire cast of distinct voices generated from a handful of reference samples — perfect for indie teams or sprawling open‑world projects.