Video
Active

Wan 2.2 Plus Text to Video

It excels in tasks like visual question answering, cross-modal retrieval, and complex data analysis involving images and language. Optimized for scalable API use, Wan2.2 T2V supports streaming and function calling to enable efficient automation of multi-modal workflows.
Try it now

AI Playground

Test all API models in the sandbox environment before you integrate. We provide more than 200 models to integrate into your app.
AI Playground image
Ai models list in playground
Testimonials

Our Clients' Voices

Wan 2.2 Plus Text to VideoTechflow Logo - Techflow X Webflow Template

Wan 2.2 Plus Text to Video

Wan2.2 T2V balances powerful multi-modal AI performance with real-world limitations in image-text understanding and processing.

Alibaba's Wan2.2 is a state-of-the-art AI model designed for multi-modal understanding, especially integrating text and vision inputs. It supports large context processing with superior precision in text-to-vision tasks and complex reasoning.

Technical Specification

Performance Benchmarks

  • VQA-bench: 78.3%
  • Multi-modal Reasoning: 52.7%
  • Cross-modal Retrieval: 81.9%.

Performance Metrics

Wan2.1 leads with an overall VBench score of 86.22%, excelling in dynamic motion, spatial relationships, color accuracy, and multi-object interaction. Training foundational video models demands vast compute power and large, high-quality datasets. Open access to these models reduces barriers, empowering more businesses to create tailored, high-quality visual content in a cost-effective way.

Key Capabilities

  • Vision-Language Fusion: Excels in interpreting and generating responses combining image and text data.
  • Advanced Reasoning: Strong in multi-step reasoning across modalities for analytics and complex understanding.

API Pricing

  • 480P: $0.105/video
  • 1080P: $0.525/video

Optimal Use Cases

  • Multi-modal Analysis: Combining image and text data for enhanced comprehension.
  • Visual Question Answering (VQA): Accurate and context-aware answers based on image-text inputs.
  • Cross-modal Retrieval: Efficient matching and retrieval across vision and language domains.
  • Business Intelligence: Complex data interpretation integrating visual content with textual analytics.

Code Sample

Comparison with Other Models

Vs. Gemini 2.5 Flash: Higher multi-modal accuracy (78.3% vs. 70.8% VQA-bench), better for integrated tasks.

Vs. OpenAI GPT-4 Vision: Larger context window (65K vs. 32K tokens text) supports longer conversations with images.

Vs. Qwen3-235B-A22B: Superior cross-modal retrieval precision (81.9% vs. ~78% estimated), optimized for large-scale vision-language workflows.

Limitations

Occasionally, videos may contain unwanted elements such as text artifacts or watermarks; using negative prompts can mitigate but does not fully eliminate these occurrences.

API Integration

Accessible via AI/ML API. Documentation: available here.

Try it now

400+ AI Models

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Best Growth Choice
for Enterprise

Get API Key