Name: Qwen3-Omni Captioner API
Brand: Alibaba Cloud

Qwen3-Omni Captioner

The Qwen3-Omni Captioner is a fine-tuned model, focused on audio input and generating detailed, low-hallucination descriptive captions from audio clips.

Overview

Qwen3-Omni Captioner is a state-of-the-art, natively end-to-end multilingual omni-modal foundation model developed by Alibaba Cloud’s Qwen team. It supports diverse input modalities including text, images, audio, and video, and delivers real-time streaming responses in both natural text and speech. Qwen3-Omni maintains high performance across modalities without any degradation, making it a leading multimodal AI solution.

Technical Specifications

Thinker-Talker Architecture: Qwen3-Omni features a Thinker-Talker design, where the Thinker generates text and the Talker converts the Thinker's high-level representations into streaming speech tokens. This separation enables specialized processing for text generation and real-time audio synthesis.
Ultra-Low-Latency Streaming: To achieve real-time streaming voice output, the Talker predicts multi-codebook sequences autoregressively. At each step, a Multi-Token Predictor (MTP) module outputs residual codebooks for the current audio frame. The Code2Wav renderer then incrementally synthesizes the waveform, producing audio frame-by-frame for seamless streaming.
AuT Audio Encoder: The model’s audio encoder, AuT, is trained on 20 million hours of audio data, providing strong and generalizable audio feature extraction.
MoE Architecture: Both Thinker and Talker subsystems use Mixture-of-Experts (MoE) models for high concurrency and fast inference, activating only a subset of parameters per token for efficiency.

Performance Benchmarks

Qwen3-Omni achieves state-of-the-art results on 22 out of 36 audio and audio-visual benchmarks, outperforming strong closed-source models including Gemini 2.5 Pro and GPT-4o-Transcribe.

Text Understanding: Competitive with top models on MMLU, GPQA, reasoning, and code tasks.
Audio Recognition (ASR): Word Error Rate (WER) on par or better than Seed-ASR and GPT-4o-Transcribe across numerous datasets.
Multimodal Reasoning: Strong performance on audio-visual question answering and video description benchmarks.
Speech Generation: High-quality multilingual speech synthesis with consistent speaker identity across 10 languages.
Streaming Latency: Low first-packet latency (~211 ms), enabling near-instant speech response.
Audio Captioning: Fine-tuned model excels in generating detailed, accurate captions for arbitrary audio.

Key Features

Architecture: MoE-based Thinker–Talker design utilizing Audio Transformer (AuT) pretraining and multi-codebook speech synthesis for low-latency and high-fidelity output.
Extensive Reasoning: Enhanced reasoning ability across modalities with the Thinking model variant.
Customization: Behavior customizable via system prompts to control tone and style of interaction.
Open-Source Audio Captioner: Fine-tuned Qwen3-Omni-30B-A3B-Captioner provides detailed, low-hallucination audio descriptions.
Real-Time Interaction: Supports natural turn-taking with immediate text or speech responses.

API Pricing

Input $4.953
Output $3.978

Use Cases

Multilingual chatbots with audio and visual understanding
Real-time streaming transcription and translation across languages
Detailed audio and video content analysis and captioning
Multimodal question answering and reasoning
Voice assistants with natural speech and multimodal comprehension
Interactive multimedia content generation and navigation

Code Sample

Comparison with Other Models

vs Gemini 2.5 Pro: Qwen3-Omni matches or exceeds Gemini’s performance on audio-video benchmarks and offers better open-source accessibility. Comparable ASR performance with lower latency in streaming speech generation.

vs Seed-ASR: Qwen3-Omni achieves superior or comparable word error rates while providing broader multimodal capabilities beyond audio.

vs GPT-4o: Qwen3-Omni excels in multimodal audio and video tasks while maintaining strong text tasks proficiency. Lower latency streaming audio output with native multi-codebook speech codec.

API Integration

Accessible via AI/ML API. Documentation: available here.

Example H2

Try it now