The Qwen3-Omni Captioner is a fine-tuned model, focused on audio input and generating detailed, low-hallucination descriptive captions from audio clips.
Overview
Qwen3-Omni Captioner is a state-of-the-art, natively end-to-end multilingual omni-modal foundation model developed by Alibaba Cloud’s Qwen team. It supports diverse input modalities including text, images, audio, and video, and delivers real-time streaming responses in both natural text and speech. Qwen3-Omni maintains high performance across modalities without any degradation, making it a leading multimodal AI solution.
Technical Specifications
Thinker-Talker Architecture: Qwen3-Omni features a Thinker-Talker design, where the Thinker generates text and the Talker converts the Thinker's high-level representations into streaming speech tokens. This separation enables specialized processing for text generation and real-time audio synthesis.
Ultra-Low-Latency Streaming: To achieve real-time streaming voice output, the Talker predicts multi-codebook sequences autoregressively. At each step, a Multi-Token Predictor (MTP) module outputs residual codebooks for the current audio frame. The Code2Wav renderer then incrementally synthesizes the waveform, producing audio frame-by-frame for seamless streaming.
AuT Audio Encoder: The model’s audio encoder, AuT, is trained on 20 million hours of audio data, providing strong and generalizable audio feature extraction.
MoE Architecture: Both Thinker and Talker subsystems use Mixture-of-Experts (MoE) models for high concurrency and fast inference, activating only a subset of parameters per token for efficiency.
Performance Benchmarks
Qwen3-Omni achieves state-of-the-art results on 22 out of 36 audio and audio-visual benchmarks, outperforming strong closed-source models including Gemini 2.5 Pro and GPT-4o-Transcribe.
Text Understanding: Competitive with top models on MMLU, GPQA, reasoning, and code tasks.
Audio Recognition (ASR): Word Error Rate (WER) on par or better than Seed-ASR and GPT-4o-Transcribe across numerous datasets.
Multimodal Reasoning: Strong performance on audio-visual question answering and video description benchmarks.
Speech Generation: High-quality multilingual speech synthesis with consistent speaker identity across 10 languages.
Real-Time Interaction: Supports natural turn-taking with immediate text or speech responses.
API Pricing
Input $4.0005
Output $3.213
Use Cases
Multilingual chatbots with audio and visual understanding
Real-time streaming transcription and translation across languages
Detailed audio and video content analysis and captioning
Multimodal question answering and reasoning
Voice assistants with natural speech and multimodal comprehension
Interactive multimedia content generation and navigation
Code Sample
Comparison with Other Models
vs Gemini 2.5 Pro: Qwen3-Omni matches or exceeds Gemini’s performance on audio-video benchmarks and offers better open-source accessibility. Comparable ASR performance with lower latency in streaming speech generation.
vs Seed-ASR: Qwen3-Omni achieves superior or comparable word error rates while providing broader multimodal capabilities beyond audio.
vs GPT-4o: Qwen3-Omni excels in multimodal audio and video tasks while maintaining strong text tasks proficiency. Lower latency streaming audio output with native multi-codebook speech codec.