

Wan 2.7 is Alibaba Tongyi Lab's most capable video generation system to date. It collapses four distinct creation modes — text-to-video, image-to-video, reference-to-video, and natural-language video editing.
Each mode in the Wan 2.7 suite targets a specific production scenario. They share the same underlying diffusion transformer architecture but expose different input contracts and motion-handling strategies.
Most text-to-video models treat a prompt as a flat string. Wan 2.7's T2V endpoint feeds it through an internal reasoning pass — what the team calls "thinking mode" — before generation begins. The result is noticeably better layout on complex prompts: multi-character scenes hold spatial logic, camera directions land where you expect them, and lighting descriptions actually propagate across the full clip.
Prompt expansion is worth enabling when you're working from short or incomplete descriptions. The model internally elaborates on scene depth, focal length, and motion dynamics — then exposes the actual prompt used so you can inspect and iterate on it.
Where most image-to-video tools animate from a starting frame and let motion drift wherever physics and chance take it, Wan 2.7's I2V gives you explicit control over both endpoints. You supply the first frame, the last frame, and the model fills in the motion path. Subject identity stays consistent across the transition, which eliminates the ghosting and gradual drift that typically ruins longer clips.
When you need a product shown from multiple perspectives in the same sequence, the 9-grid input lets you feed in a contact sheet of reference angles. The model stitches these into a coherent multi-shot clip rather than treating each angle as a separate generation, keeping brand visuals consistent across every frame.
R2V is arguably the most technically ambitious mode in the suite. It's built for teams that need the same person, character, or product to appear consistently across many clips — without a traditional fine-tuning or LoRA workflow. You pass references in; the model extracts identity embeddings and locks them into the generation process.
The five-reference ceiling is the highest in the industry right now. You can mix image and video references freely within that budget, which means you can supply a front-facing photo, a side profile, two motion clips showing how the character moves, and an audio clip capturing their voice and the output holds all of those attributes simultaneously.
For AI avatar production, digital influencer pipelines, or any project where character consistency has historically meant expensive per-subject fine-tuning, R2V changes the economics significantly.
Wan 2.7 spans a wide range of commercial applications. The combination of high-resolution output, character-stable R2V, and natural-language editing removes dependencies that previously required dedicated production crews or per-project model fine-tuning.
Wan 2.7 is built on a Diffusion Transformer (DiT) foundation combined with Flow Matching — the same architectural direction that has driven consistent scaling gains in both image and video generation over the past two years. Cross-attention handles text conditioning; full spatio-temporal attention captures motion dynamics across both spatial and temporal axes simultaneously.
Each mode in the Wan 2.7 suite targets a specific production scenario. They share the same underlying diffusion transformer architecture but expose different input contracts and motion-handling strategies.
Most text-to-video models treat a prompt as a flat string. Wan 2.7's T2V endpoint feeds it through an internal reasoning pass — what the team calls "thinking mode" — before generation begins. The result is noticeably better layout on complex prompts: multi-character scenes hold spatial logic, camera directions land where you expect them, and lighting descriptions actually propagate across the full clip.
Prompt expansion is worth enabling when you're working from short or incomplete descriptions. The model internally elaborates on scene depth, focal length, and motion dynamics — then exposes the actual prompt used so you can inspect and iterate on it.
Where most image-to-video tools animate from a starting frame and let motion drift wherever physics and chance take it, Wan 2.7's I2V gives you explicit control over both endpoints. You supply the first frame, the last frame, and the model fills in the motion path. Subject identity stays consistent across the transition, which eliminates the ghosting and gradual drift that typically ruins longer clips.
When you need a product shown from multiple perspectives in the same sequence, the 9-grid input lets you feed in a contact sheet of reference angles. The model stitches these into a coherent multi-shot clip rather than treating each angle as a separate generation, keeping brand visuals consistent across every frame.
R2V is arguably the most technically ambitious mode in the suite. It's built for teams that need the same person, character, or product to appear consistently across many clips — without a traditional fine-tuning or LoRA workflow. You pass references in; the model extracts identity embeddings and locks them into the generation process.
The five-reference ceiling is the highest in the industry right now. You can mix image and video references freely within that budget, which means you can supply a front-facing photo, a side profile, two motion clips showing how the character moves, and an audio clip capturing their voice and the output holds all of those attributes simultaneously.
For AI avatar production, digital influencer pipelines, or any project where character consistency has historically meant expensive per-subject fine-tuning, R2V changes the economics significantly.
Wan 2.7 spans a wide range of commercial applications. The combination of high-resolution output, character-stable R2V, and natural-language editing removes dependencies that previously required dedicated production crews or per-project model fine-tuning.
Wan 2.7 is built on a Diffusion Transformer (DiT) foundation combined with Flow Matching — the same architectural direction that has driven consistent scaling gains in both image and video generation over the past two years. Cross-attention handles text conditioning; full spatio-temporal attention captures motion dynamics across both spatial and temporal axes simultaneously.