Wan 2.7 video API — One API 400+ AI Models

Wan 2.7 video

Wan 2.7 is Alibaba Tongyi Lab's most capable video generation system to date. It collapses four distinct creation modes — text-to-video, image-to-video, reference-to-video, and natural-language video editing.

Four ways to create video

Each mode in the Wan 2.7 suite targets a specific production scenario. They share the same underlying diffusion transformer architecture but expose different input contracts and motion-handling strategies.

T2V

Text to Video

Turn written prompts into 720p–1080p video clips. Thinking mode handles dense, multi-shot scene descriptions with higher compositional accuracy than prior versions.

I2V

Image to Video

Animate a single image or a 9-grid multi-angle set. Specify both first and last frame, and the model auto-infers the motion in between while holding subject identity stable.

R2V

Reference to Video

Pass up to five image, video, or audio references in one call. Locks appearance, voice tone, lip sync, camera movement, and effects simultaneously — industry-leading reference count.

VideoEdit

Video Edit

Rewrite existing footage with a plain-language instruction. Handles local edits, style transfer, colorization, and restoration without full re-generation.

Text to Video

Most text-to-video models treat a prompt as a flat string. Wan 2.7's T2V endpoint feeds it through an internal reasoning pass — what the team calls "thinking mode" — before generation begins. The result is noticeably better layout on complex prompts: multi-character scenes hold spatial logic, camera directions land where you expect them, and lighting descriptions actually propagate across the full clip.

Key T2V capabilities

Accepts natural-language scene descriptions with embedded camera and lighting instructions
Generates multi-shot sequences from a single prompt with automatic transition planning
Supports audio URL input for synchronized background music or sound design
Prompt expansion option rewrites short inputs with cinematographic detail before generation
Output at 720p or 1080p, in 16:9, 9:16, 1:1, 4:3, and 3:4 aspect ratios
Duration is configurable; 2–15 seconds depending on scene complexity

Prompt expansion is worth enabling when you're working from short or incomplete descriptions. The model internally elaborates on scene depth, focal length, and motion dynamics — then exposes the actual prompt used so you can inspect and iterate on it.

Image to Video

Where most image-to-video tools animate from a starting frame and let motion drift wherever physics and chance take it, Wan 2.7's I2V gives you explicit control over both endpoints. You supply the first frame, the last frame, and the model fills in the motion path. Subject identity stays consistent across the transition, which eliminates the ghosting and gradual drift that typically ruins longer clips.

Multi-angle and 9-grid support

When you need a product shown from multiple perspectives in the same sequence, the 9-grid input lets you feed in a contact sheet of reference angles. The model stitches these into a coherent multi-shot clip rather than treating each angle as a separate generation, keeping brand visuals consistent across every frame.

I2V also accepts

A preceding video clip as a continuation reference (first_clip_url) for scene-to-scene flow
A driving audio track for lip-sync or rhythm-matched motion
Optional text prompt layered on top of image references for guided motion direction

Reference to Video

R2V is arguably the most technically ambitious mode in the suite. It's built for teams that need the same person, character, or product to appear consistently across many clips — without a traditional fine-tuning or LoRA workflow. You pass references in; the model extracts identity embeddings and locks them into the generation process.

The five-reference ceiling is the highest in the industry right now. You can mix image and video references freely within that budget, which means you can supply a front-facing photo, a side profile, two motion clips showing how the character moves, and an audio clip capturing their voice and the output holds all of those attributes simultaneously.

What R2V locks in

Visual appearance and facial geometry across varied lighting and camera angles
Voice tone and lip sync from a reference audio clip (reference_voice)
Camera movement style extracted from a reference video
Special effects or visual motifs carried forward from reference material
Complex, high-motion actions reproduced stably without identity collapse

For AI avatar production, digital influencer pipelines, or any project where character consistency has historically meant expensive per-subject fine-tuning, R2V changes the economics significantly.

API Pricing

‍720p: $0.13 per second‍
1080p: $0.195 per second

Where teams are using it in production

Wan 2.7 spans a wide range of commercial applications. The combination of high-resolution output, character-stable R2V, and natural-language editing removes dependencies that previously required dedicated production crews or per-project model fine-tuning.

Ad creative generation
AI avatar pipelines
Product showcases
Social content at scale
Storyboarding & pitches
Archival restoration
Concept visualization
Training data synthesis

Choosing the right mode

Mode	Primary input	Best for	Max refs
T2V	Text prompt + optional audio	New scenes from scratch	—
I2V	1 or 9-grid images, first/last frame	Animating existing visuals	9 images
R2V	Images, videos, audio refs	Character-consistent clips	5 mixed

Architecture and output parameters

Wan 2.7 is built on a Diffusion Transformer (DiT) foundation combined with Flow Matching — the same architectural direction that has driven consistent scaling gains in both image and video generation over the past two years. Cross-attention handles text conditioning; full spatio-temporal attention captures motion dynamics across both spatial and temporal axes simultaneously.

Shared output parameters across all modes

Resolution: 720p or 1080p
Aspect ratio:1 6:9, 9:16, 1:1, 4:3, 3:4
Duration: 2–10 seconds for I2V and R2V; up to 15 seconds for T2V
Prompt extension: optional intelligent rewriting for short inputs
Seed control: full seed parameter for reproducible outputs

‍

Example H2

Try it now

Four ways to create video

T2V

Text to Video

Turn written prompts into 720p–1080p video clips. Thinking mode handles dense, multi-shot scene descriptions with higher compositional accuracy than prior versions.

I2V

Image to Video

Animate a single image or a 9-grid multi-angle set. Specify both first and last frame, and the model auto-infers the motion in between while holding subject identity stable.

R2V

Reference to Video

Pass up to five image, video, or audio references in one call. Locks appearance, voice tone, lip sync, camera movement, and effects simultaneously — industry-leading reference count.

VideoEdit

Video Edit

Rewrite existing footage with a plain-language instruction. Handles local edits, style transfer, colorization, and restoration without full re-generation.

Text to Video

Key T2V capabilities

Accepts natural-language scene descriptions with embedded camera and lighting instructions
Generates multi-shot sequences from a single prompt with automatic transition planning
Supports audio URL input for synchronized background music or sound design
Prompt expansion option rewrites short inputs with cinematographic detail before generation
Output at 720p or 1080p, in 16:9, 9:16, 1:1, 4:3, and 3:4 aspect ratios
Duration is configurable; 2–15 seconds depending on scene complexity

Image to Video

Multi-angle and 9-grid support

I2V also accepts

A preceding video clip as a continuation reference (first_clip_url) for scene-to-scene flow
A driving audio track for lip-sync or rhythm-matched motion
Optional text prompt layered on top of image references for guided motion direction

Reference to Video

What R2V locks in

Visual appearance and facial geometry across varied lighting and camera angles
Voice tone and lip sync from a reference audio clip (reference_voice)
Camera movement style extracted from a reference video
Special effects or visual motifs carried forward from reference material
Complex, high-motion actions reproduced stably without identity collapse

For AI avatar production, digital influencer pipelines, or any project where character consistency has historically meant expensive per-subject fine-tuning, R2V changes the economics significantly.

API Pricing

‍720p: $0.13 per second‍
1080p: $0.195 per second

Where teams are using it in production

Ad creative generation
AI avatar pipelines
Product showcases
Social content at scale
Storyboarding & pitches
Archival restoration
Concept visualization
Training data synthesis

Choosing the right mode

Mode	Primary input	Best for	Max refs
T2V	Text prompt + optional audio	New scenes from scratch	—
I2V	1 or 9-grid images, first/last frame	Animating existing visuals	9 images
R2V	Images, videos, audio refs	Character-consistent clips	5 mixed