Qwen3.5 Omni Flash

Qwen3.5 Omni Flash scores well above the median for non-reasoning models in its price tier on the Artificial Analysis Intelligence Index (26 vs. a median of 15), generates output at 164 tokens per second, and supports every major input modality out of the box. It's competitively priced but runs verbose, producing roughly 2.5x more output tokens than average, which affects billing on output-heavy tasks.

What Qwen3.5 Omni Flash actually is

Most models that claim multimodal support are really just text models with extra adapters bolted on. Qwen3.5 Omni Flash is different. It processes text, speech, image, and video natively — all inside one unified architecture — and can generate both text and streaming speech as output. The Flash variant is the production-default recommendation from Alibaba: the middle path between maximum quality and minimum cost.

API Pricing

Input (text/image/video): $0.52
Input (audio): $3.9
Output (text): $2.86
Output (text+audio): $15.47

Architecture: Thinker and Talker

Qwen3.5 Omni Flash builds on the Thinker–Talker architecture introduced in Qwen2.5-Omni, then adds five major technical upgrades for the 3.5 generation. Understanding the design matters if you're integrating this into latency-sensitive pipelines.

How the two components work

Thinker — the reasoning core

Handles all understanding tasks. Ingests text, image, audio (via the AuT encoder), and video frames simultaneously. Uses a Hybrid-Attention Mixture-of-Experts (MoE) design, meaning only a subset of parameters activates per token, which is how the model achieves fast inference without sacrificing model capacity.

Talker — the speech synthesizer

Receives high-level representations directly from Thinker during decoding and autoregressively predicts multi-codebook speech tokens. A causal ConvNet then reconstructs the waveform frame-by-frame, enabling streaming speech output with very low latency. The Talker also supports a dedicated voice system prompt for zero-shot voice cloning.

ARIA — alignment for real-time interaction

A new technique introduced in 3.5 that dynamically aligns text and speech units during streaming decoding. This solves a long-standing naturalness problem in auto-regressive speech: the output sounds more fluid and less robotic, especially in multi-turn spoken dialogue.

Multi-codebook codec

Rather than synthesizing audio frame by frame at high latency, a multi-token prediction (MTP) module models residual codebooks for each frame, enabling single-frame, immediate synthesis. This is what keeps voice response latency in the sub-300ms range even on long utterances.

Long-context audio/visual handling

The 256K token context window extends to audio and video too. The model can process over 10 hours of audio content, or roughly 400 seconds of 720p video sampled at 1 FPS — in a single call. Audio features are downsampled to 12.5Hz via Conv2D blocks before the attention layers, keeping the token count manageable.

Supported modalities

Flash supports the full input-output modality set. Language coverage is the model's most striking feature: 113 languages and dialects for speech recognition, 36 for speech generation — a meaningful jump over prior generations.

Input modalities: Text, Image, Audio / SpeechVideo‍
Output modalities: Text, Streaming speech

Language coverage

‍Speech recognition: 113 languages and dialects. Includes English, Chinese, Korean, Japanese, German, Russian, French, Spanish, Arabic, Urdu, and dozens more.‍
Speech generation: 36 language output. Zero-shot voice cloning from a 10–30 second audio sample. Real-time control over speed, volume, and emotional tone.‍
Text languages: Strong multilingual text capability. In testing, seamlessly handled prompts in Spanish, Portuguese, and English within the same conversation without losing context.

Key interaction features

‍Semantic interruption: Distinguishes between filler sounds (hmm, uh-huh) and genuine intent to interrupt. The model won't stop mid-sentence when someone coughs — it actually understands conversational intent.‍
Audio-Visual Vibe Coding: Can watch a screen recording or video of a coding task and write functional code from what it sees and hears — no text prompt required. An early preview of workflow-embedded AI.

Performance benchmarks

Numbers from the Qwen3.5-Omni Technical Report (April 2026) and third-party evaluation via Artificial Analysis. The Plus variant carries the headline benchmark scores, but Flash is explicitly designed to retain the overwhelming majority of that quality at lower latency and cost.

Intelligence index breakdown (Flash, non-reasoning models)

26 / 36

AA Intelligence Index

164 t/s

Output throughput

256K

Context window

Qwen3.5-Omni family vs. external models (Plus variant headline results)

Audio understanding

vs Gemini 3.1 Pro

Wins

Audio translation

vs Gemini 3.1 Pro

Wins

Multilingual voice

vs ElevenLabs / GPT-Audio

Wins

Audio-visual comprehension

vs Gemini 3.1 Pro

Matches

Who it's built for

The Flash variant's combination of modalities, speed, and pricing makes it relevant to a specific set of production problems, particularly anywhere that text-first models with bolted-on audio/video adapters have been falling short.

`01 Voice agent pipelines`

Streaming speech output with sub-300ms latency via the predecessor; ARIA alignment in 3.5 makes multi-turn dialogue sound natural rather than mechanical.

`02 Auto-captioning at scale`

Native audio + video in one model, 256K context, and 113 recognition languages. Batch test Flash vs. Plus for quality — Flash is the economic starting point.

`03 Real-time transcription`

Interviews, meetings, and media monitoring across 113 language/dialect combinations. Semantic interruption handling makes it suitable for live conversations.

`04 Video summarization`

Feed raw video (up to ~400 seconds of 720p at 1 FPS) and get a cross-modal summary that fuses what's said, what's shown, and what's written on screen — in one pass.

`05 Multilingual customer support`

36 speech output languages and voice cloning via API. Send a 10–30s voice sample to clone the voice for consistent brand tone across markets.

`06 Developer tooling & coding assistants`

Audio-Visual Vibe Coding lets the model interpret screen recordings and write functional code from what it sees. A useful bridge for workflow-embedded tooling.

‍

Example H2

Try it now

What Qwen3.5 Omni Flash actually is

API Pricing

Input (text/image/video): $0.52
Input (audio): $3.9
Output (text): $2.86
Output (text+audio): $15.47

Architecture: Thinker and Talker

How the two components work

Thinker — the reasoning core

Talker — the speech synthesizer

ARIA — alignment for real-time interaction

Multi-codebook codec

Long-context audio/visual handling

Supported modalities

Input modalities: Text, Image, Audio / SpeechVideo‍
Output modalities: Text, Streaming speech

Language coverage

‍Speech recognition: 113 languages and dialects. Includes English, Chinese, Korean, Japanese, German, Russian, French, Spanish, Arabic, Urdu, and dozens more.‍
Speech generation: 36 language output. Zero-shot voice cloning from a 10–30 second audio sample. Real-time control over speed, volume, and emotional tone.‍
Text languages: Strong multilingual text capability. In testing, seamlessly handled prompts in Spanish, Portuguese, and English within the same conversation without losing context.

Key interaction features

‍Semantic interruption: Distinguishes between filler sounds (hmm, uh-huh) and genuine intent to interrupt. The model won't stop mid-sentence when someone coughs — it actually understands conversational intent.‍
Audio-Visual Vibe Coding: Can watch a screen recording or video of a coding task and write functional code from what it sees and hears — no text prompt required. An early preview of workflow-embedded AI.

Performance benchmarks

Intelligence index breakdown (Flash, non-reasoning models)

26 / 36

AA Intelligence Index

164 t/s

Output throughput

256K

Context window

Qwen3.5-Omni family vs. external models (Plus variant headline results)

Audio understanding

vs Gemini 3.1 Pro

Wins

Audio translation

vs Gemini 3.1 Pro

Wins

Multilingual voice

vs ElevenLabs / GPT-Audio

Wins

Audio-visual comprehension

vs Gemini 3.1 Pro

Matches

Who it's built for

`01 Voice agent pipelines`

Streaming speech output with sub-300ms latency via the predecessor; ARIA alignment in 3.5 makes multi-turn dialogue sound natural rather than mechanical.

`02 Auto-captioning at scale`

Native audio + video in one model, 256K context, and 113 recognition languages. Batch test Flash vs. Plus for quality — Flash is the economic starting point.

`03 Real-time transcription`

Interviews, meetings, and media monitoring across 113 language/dialect combinations. Semantic interruption handling makes it suitable for live conversations.

`04 Video summarization`

Feed raw video (up to ~400 seconds of 720p at 1 FPS) and get a cross-modal summary that fuses what's said, what's shown, and what's written on screen — in one pass.

`05 Multilingual customer support`

36 speech output languages and voice cloning via API. Send a 10–30s voice sample to clone the voice for consistent brand tone across markets.

`06 Developer tooling & coding assistants`

Audio-Visual Vibe Coding lets the model interpret screen recordings and write functional code from what it sees. A useful bridge for workflow-embedded tooling.

‍

Try it now

Qwen3.5 Omni Flash