

The Qwen3-Omni Captioner is a fine-tuned model, focused on audio input and generating detailed, low-hallucination descriptive captions from audio clips.
Qwen3-Omni Captioner is a state-of-the-art, natively end-to-end multilingual omni-modal foundation model developed by Alibaba Cloud’s Qwen team. It supports diverse input modalities including text, images, audio, and video, and delivers real-time streaming responses in both natural text and speech. Qwen3-Omni maintains high performance across modalities without any degradation, making it a leading multimodal AI solution.
Qwen3-Omni achieves state-of-the-art results on 22 out of 36 audio and audio-visual benchmarks, outperforming strong closed-source models including Gemini 2.5 Pro and GPT-4o-Transcribe.

vs Gemini 2.5 Pro: Qwen3-Omni matches or exceeds Gemini’s performance on audio-video benchmarks and offers better open-source accessibility. Comparable ASR performance with lower latency in streaming speech generation.
vs Seed-ASR: Qwen3-Omni achieves superior or comparable word error rates while providing broader multimodal capabilities beyond audio.
vs GPT-4o: Qwen3-Omni excels in multimodal audio and video tasks while maintaining strong text tasks proficiency. Lower latency streaming audio output with native multi-codebook speech codec.
Accessible via AI/ML API. Documentation: available here.
Qwen3-Omni Captioner is a state-of-the-art, natively end-to-end multilingual omni-modal foundation model developed by Alibaba Cloud’s Qwen team. It supports diverse input modalities including text, images, audio, and video, and delivers real-time streaming responses in both natural text and speech. Qwen3-Omni maintains high performance across modalities without any degradation, making it a leading multimodal AI solution.
Qwen3-Omni achieves state-of-the-art results on 22 out of 36 audio and audio-visual benchmarks, outperforming strong closed-source models including Gemini 2.5 Pro and GPT-4o-Transcribe.

vs Gemini 2.5 Pro: Qwen3-Omni matches or exceeds Gemini’s performance on audio-video benchmarks and offers better open-source accessibility. Comparable ASR performance with lower latency in streaming speech generation.
vs Seed-ASR: Qwen3-Omni achieves superior or comparable word error rates while providing broader multimodal capabilities beyond audio.
vs GPT-4o: Qwen3-Omni excels in multimodal audio and video tasks while maintaining strong text tasks proficiency. Lower latency streaming audio output with native multi-codebook speech codec.
Accessible via AI/ML API. Documentation: available here.