0.52
2.86
Chat
Active

Qwen3.5 Omni Flash

A natively omnimodal AI that reads text, watches video, listens to audio, and analyzes images — all at once, in a single forward pass. Built for production workloads where latency and cost can't be afterthoughts.
Qwen3.5 Omni FlashTechflow Logo - Techflow X Webflow Template

Qwen3.5 Omni Flash

Qwen3.5 Omni Flash scores well above the median for non-reasoning models in its price tier on the Artificial Analysis Intelligence Index (26 vs. a median of 15), generates output at 164 tokens per second, and supports every major input modality out of the box. It's competitively priced but runs verbose, producing roughly 2.5x more output tokens than average, which affects billing on output-heavy tasks.

What Qwen3.5 Omni Flash actually is

Most models that claim multimodal support are really just text models with extra adapters bolted on. Qwen3.5 Omni Flash is different. It processes text, speech, image, and video natively — all inside one unified architecture — and can generate both text and streaming speech as output. The Flash variant is the production-default recommendation from Alibaba: the middle path between maximum quality and minimum cost.

API Pricing

  • Input (text/image/video): $0.52
  • Input (audio): $3.9
  • Output (text): $2.86
  • Output (text+audio): $15.47

Architecture: Thinker and Talker

Qwen3.5 Omni Flash builds on the Thinker–Talker architecture introduced in Qwen2.5-Omni, then adds five major technical upgrades for the 3.5 generation. Understanding the design matters if you're integrating this into latency-sensitive pipelines.

How the two components work

Thinker — the reasoning core

Handles all understanding tasks. Ingests text, image, audio (via the AuT encoder), and video frames simultaneously. Uses a Hybrid-Attention Mixture-of-Experts (MoE) design, meaning only a subset of parameters activates per token, which is how the model achieves fast inference without sacrificing model capacity.

Talker — the speech synthesizer

Receives high-level representations directly from Thinker during decoding and autoregressively predicts multi-codebook speech tokens. A causal ConvNet then reconstructs the waveform frame-by-frame, enabling streaming speech output with very low latency. The Talker also supports a dedicated voice system prompt for zero-shot voice cloning.

ARIA — alignment for real-time interaction

A new technique introduced in 3.5 that dynamically aligns text and speech units during streaming decoding. This solves a long-standing naturalness problem in auto-regressive speech: the output sounds more fluid and less robotic, especially in multi-turn spoken dialogue.

Multi-codebook codec

Rather than synthesizing audio frame by frame at high latency, a multi-token prediction (MTP) module models residual codebooks for each frame, enabling single-frame, immediate synthesis. This is what keeps voice response latency in the sub-300ms range even on long utterances.

Long-context audio/visual handling

The 256K token context window extends to audio and video too. The model can process over 10 hours of audio content, or roughly 400 seconds of 720p video sampled at 1 FPS — in a single call. Audio features are downsampled to 12.5Hz via Conv2D blocks before the attention layers, keeping the token count manageable.

Supported modalities

Flash supports the full input-output modality set. Language coverage is the model's most striking feature: 113 languages and dialects for speech recognition, 36 for speech generation — a meaningful jump over prior generations.

  • Input modalities: Text, Image, Audio / SpeechVideo
  • Output modalities: Text, Streaming speech

Language coverage

  • Speech recognition: 113 languages and dialects. Includes English, Chinese, Korean, Japanese, German, Russian, French, Spanish, Arabic, Urdu, and dozens more.
  • Speech generation: 36 language output. Zero-shot voice cloning from a 10–30 second audio sample. Real-time control over speed, volume, and emotional tone.
  • Text languages: Strong multilingual text capability. In testing, seamlessly handled prompts in Spanish, Portuguese, and English within the same conversation without losing context.

Key interaction features

  • Semantic interruption: Distinguishes between filler sounds (hmm, uh-huh) and genuine intent to interrupt. The model won't stop mid-sentence when someone coughs — it actually understands conversational intent.
  • Audio-Visual Vibe Coding: Can watch a screen recording or video of a coding task and write functional code from what it sees and hears — no text prompt required. An early preview of workflow-embedded AI.

Performance benchmarks

Numbers from the Qwen3.5-Omni Technical Report (April 2026) and third-party evaluation via Artificial Analysis. The Plus variant carries the headline benchmark scores, but Flash is explicitly designed to retain the overwhelming majority of that quality at lower latency and cost.

Intelligence index breakdown (Flash, non-reasoning models)

26 / 36
AA Intelligence Index
164 t/s
Output throughput
256K
Context window

Qwen3.5-Omni family vs. external models (Plus variant headline results)

Audio understanding
vs Gemini 3.1 Pro
Wins
Audio translation
vs Gemini 3.1 Pro
Wins
Multilingual voice
vs ElevenLabs / GPT-Audio
Wins
Audio-visual comprehension
vs Gemini 3.1 Pro
Matches

Who it's built for

The Flash variant's combination of modalities, speed, and pricing makes it relevant to a specific set of production problems, particularly anywhere that text-first models with bolted-on audio/video adapters have been falling short.

01 Voice agent pipelines

Streaming speech output with sub-300ms latency via the predecessor; ARIA alignment in 3.5 makes multi-turn dialogue sound natural rather than mechanical.

02 Auto-captioning at scale

Native audio + video in one model, 256K context, and 113 recognition languages. Batch test Flash vs. Plus for quality — Flash is the economic starting point.

03 Real-time transcription

Interviews, meetings, and media monitoring across 113 language/dialect combinations. Semantic interruption handling makes it suitable for live conversations.

04 Video summarization

Feed raw video (up to ~400 seconds of 720p at 1 FPS) and get a cross-modal summary that fuses what's said, what's shown, and what's written on screen — in one pass.

05 Multilingual customer support

36 speech output languages and voice cloning via API. Send a 10–30s voice sample to clone the voice for consistent brand tone across markets.

06 Developer tooling & coding assistants

Audio-Visual Vibe Coding lets the model interpret screen recordings and write functional code from what it sees. A useful bridge for workflow-embedded tooling.

What Qwen3.5 Omni Flash actually is

Most models that claim multimodal support are really just text models with extra adapters bolted on. Qwen3.5 Omni Flash is different. It processes text, speech, image, and video natively — all inside one unified architecture — and can generate both text and streaming speech as output. The Flash variant is the production-default recommendation from Alibaba: the middle path between maximum quality and minimum cost.

API Pricing

  • Input (text/image/video): $0.52
  • Input (audio): $3.9
  • Output (text): $2.86
  • Output (text+audio): $15.47

Architecture: Thinker and Talker

Qwen3.5 Omni Flash builds on the Thinker–Talker architecture introduced in Qwen2.5-Omni, then adds five major technical upgrades for the 3.5 generation. Understanding the design matters if you're integrating this into latency-sensitive pipelines.

How the two components work

Thinker — the reasoning core

Handles all understanding tasks. Ingests text, image, audio (via the AuT encoder), and video frames simultaneously. Uses a Hybrid-Attention Mixture-of-Experts (MoE) design, meaning only a subset of parameters activates per token, which is how the model achieves fast inference without sacrificing model capacity.

Talker — the speech synthesizer

Receives high-level representations directly from Thinker during decoding and autoregressively predicts multi-codebook speech tokens. A causal ConvNet then reconstructs the waveform frame-by-frame, enabling streaming speech output with very low latency. The Talker also supports a dedicated voice system prompt for zero-shot voice cloning.

ARIA — alignment for real-time interaction

A new technique introduced in 3.5 that dynamically aligns text and speech units during streaming decoding. This solves a long-standing naturalness problem in auto-regressive speech: the output sounds more fluid and less robotic, especially in multi-turn spoken dialogue.

Multi-codebook codec

Rather than synthesizing audio frame by frame at high latency, a multi-token prediction (MTP) module models residual codebooks for each frame, enabling single-frame, immediate synthesis. This is what keeps voice response latency in the sub-300ms range even on long utterances.

Long-context audio/visual handling

The 256K token context window extends to audio and video too. The model can process over 10 hours of audio content, or roughly 400 seconds of 720p video sampled at 1 FPS — in a single call. Audio features are downsampled to 12.5Hz via Conv2D blocks before the attention layers, keeping the token count manageable.

Supported modalities

Flash supports the full input-output modality set. Language coverage is the model's most striking feature: 113 languages and dialects for speech recognition, 36 for speech generation — a meaningful jump over prior generations.

  • Input modalities: Text, Image, Audio / SpeechVideo
  • Output modalities: Text, Streaming speech

Language coverage

  • Speech recognition: 113 languages and dialects. Includes English, Chinese, Korean, Japanese, German, Russian, French, Spanish, Arabic, Urdu, and dozens more.
  • Speech generation: 36 language output. Zero-shot voice cloning from a 10–30 second audio sample. Real-time control over speed, volume, and emotional tone.
  • Text languages: Strong multilingual text capability. In testing, seamlessly handled prompts in Spanish, Portuguese, and English within the same conversation without losing context.

Key interaction features

  • Semantic interruption: Distinguishes between filler sounds (hmm, uh-huh) and genuine intent to interrupt. The model won't stop mid-sentence when someone coughs — it actually understands conversational intent.
  • Audio-Visual Vibe Coding: Can watch a screen recording or video of a coding task and write functional code from what it sees and hears — no text prompt required. An early preview of workflow-embedded AI.

Performance benchmarks

Numbers from the Qwen3.5-Omni Technical Report (April 2026) and third-party evaluation via Artificial Analysis. The Plus variant carries the headline benchmark scores, but Flash is explicitly designed to retain the overwhelming majority of that quality at lower latency and cost.

Intelligence index breakdown (Flash, non-reasoning models)

26 / 36
AA Intelligence Index
164 t/s
Output throughput
256K
Context window

Qwen3.5-Omni family vs. external models (Plus variant headline results)

Audio understanding
vs Gemini 3.1 Pro
Wins
Audio translation
vs Gemini 3.1 Pro
Wins
Multilingual voice
vs ElevenLabs / GPT-Audio
Wins
Audio-visual comprehension
vs Gemini 3.1 Pro
Matches

Who it's built for

The Flash variant's combination of modalities, speed, and pricing makes it relevant to a specific set of production problems, particularly anywhere that text-first models with bolted-on audio/video adapters have been falling short.

01 Voice agent pipelines

Streaming speech output with sub-300ms latency via the predecessor; ARIA alignment in 3.5 makes multi-turn dialogue sound natural rather than mechanical.

02 Auto-captioning at scale

Native audio + video in one model, 256K context, and 113 recognition languages. Batch test Flash vs. Plus for quality — Flash is the economic starting point.

03 Real-time transcription

Interviews, meetings, and media monitoring across 113 language/dialect combinations. Semantic interruption handling makes it suitable for live conversations.

04 Video summarization

Feed raw video (up to ~400 seconds of 720p at 1 FPS) and get a cross-modal summary that fuses what's said, what's shown, and what's written on screen — in one pass.

05 Multilingual customer support

36 speech output languages and voice cloning via API. Send a 10–30s voice sample to clone the voice for consistent brand tone across markets.

06 Developer tooling & coding assistants

Audio-Visual Vibe Coding lets the model interpret screen recordings and write functional code from what it sees. A useful bridge for workflow-embedded tooling.

Try it now

400+ AI Models

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Best Growth Choice
for Enterprise

Get API Key
Testimonials

Our Clients' Voices