

Seedance 1.5 API supports multi-modal inputs for seamless integration of visuals, audio, and text.
Most AI video tools operate in two steps: generate the visual, then stitch in audio afterward. That two-step process is why so much AI-generated content looks and sounds slightly off, the sound effects land a beat late, lips don't quite match words, and ambient noise feels pasted in.
Seedance 1.5 Pro takes a different approach entirely. Developed by ByteDance's Seed team and published in December 2025, it is a foundational model built from the ground up for native, joint audio-video generation. Audio and video aren't added to each other, they're created together, sharing the same generation process, the same attention layers, the same loss functions.
The result is millisecond-level synchronization between what you see and what you hear: lips that move in precise time with spoken words, ambient sounds that materialize exactly when objects collide on screen, background music that breathes with the pacing of the shot.
Seedance 1.5 Pro API Pricing
Six capabilities that set this model apart from every other video generation API on the market today.
Ambient sounds, action effects, background music, and human voices are generated simultaneously with the video frames, not appended afterward. The dual-branch Diffusion Transformer processes both modalities in parallel, synchronized at the architectural level.
The model understands phonemes, the individual sounds that make up speech, and maps them correctly to lip shapes in real time. This works across English, Mandarin, Japanese, Korean, Spanish, Indonesian, Cantonese, Shanxi dialect, and Sichuan dialect, with each language's natural rhythm preserved.
Specify professional camera movements directly in your prompt: dolly zooms, Hitchcock effects, crane movements, tracking shots, whip pans, and orbits. The model processes compositional language — golden hour lighting, rack focus, shallow depth of field and executes it accurately across the generated clip.
Subtle micro-expressions, a slight swallow, eyes widening, anxiety transitioning to relief, are rendered accurately based on the prompt context and image input. This removes the mechanical stiffness common in AI video. Characters behave, not just move.
When generating multiple clips for the same narrative, the model preserves character identity: faces don't morph, clothing stays consistent, and proportions remain stable even during complex movements or 12-second clips. Provide a reference image as an anchor to lock appearance across a full sequence.
Through multi-stage distillation and quantization, ByteDance achieved a 10x speedup in inference over the base model. What once took 20–30 minutes now takes 2–3 minutes without meaningful quality loss — fast enough for real products, not just demos.
Understanding the architecture helps you write better prompts and predict model behavior. Here's what's actually happening when you make an API call.
The core is a 4.5 billion parameter Dual-Branch Diffusion Transformer. Two parallel branches — one for video frames, one for audio waveforms — run concurrently and share information through cross-modal attention fusion modules. Because both branches see each other's representations during generation, they stay in lock-step from the very first denoising step.
The model was trained on mixed-modal datasets using curriculum-based data scheduling, robust captioning, and semantic enrichment. Pre-training covers text-to-audio-video (T2VA), image-to-audio-video (I2VA), and unimodal tasks (T2V, I2V). This multi-task approach means a single model handles all input modes without switching contexts between API calls.
After pre-training, the team ran Supervised Fine-Tuning on curated high-quality data, followed by Reinforcement Learning from Human Feedback with multi-dimensional reward models calibrated for audio-visual contexts — not just visual preference signals. This is why the model follows complex narrative prompts reliably, rather than generating visually attractive but semantically incoherent clips.
Performance was measured using SeedVideoBench-1.5, an internally developed benchmark covering both the video stream (subjects, motion, interaction, cinematography) and the audio stream (vocal types, non-speech audio properties, synchronization). Evaluation uses both a 5-point Likert scale and pairwise Good-Same-Bad metrics for subjective quality — the same methodology used for professional production content review.
Everything you need to plan your integration before writing a line of code.
These are the real-world categories where native audio-video generation creates the most concrete value.
Generate TikTok and Reels-format content in 9:16 at volume. With character consistency and natural dialogue, content teams can produce multi-episode virtual creator series without actors or studios.
Create the same product demo in English, Japanese, Mandarin, and Spanish from a single source image with native lip-sync in each language. One key visual, six market-ready cuts, no localization agency.
Generate storyboard animatics with camera movements and emotionally expressive characters to communicate director intent to crews before principal photography — faster and cheaper than traditional animatics.
Animate product images into short cinematic demonstrations with voiceover and ambient environment sounds. A still photo of a coffee maker can become a 10-second atmospheric clip with steam, pouring sounds, and narration.
Build interactive story experiences where player choices generate new video clips in real time, each with synchronized dialogue, effects, and music. Game cutscenes without a cutscene budget.
Seedance 1.5 Pro's RLHF training specifically targeted advertising, micro-dramas, and narrative content. Short emotional arcs, dialogue-heavy scenes, and brand voice all come through with production coherence.
The lip-sync capabilities of Seedance 1.5 Pro go deeper than simple language detection. The model was trained on phoneme-level data across each of these languages and dialects, meaning it doesn't just move lips — it moves the correct lip shapes for the actual sounds being made in that language's phonological system.
This is particularly visible in dialect support, where standard Mandarin and Sichuan dialect have genuinely different phoneme distributions. The model handles both distinctly, not as variants of the same thing.
Does Seedance 1.5 Pro always generate audio, or can I get silent video?
Both modes are supported. Set audio: false in your request for a silent clip at the same quality. Audio generation is on by default since it's a core differentiator, but disabling it does not affect video quality and slightly reduces generation time.
How does the character consistency feature work in practice?
Pass a reference image in the image_url field alongside your prompt. The model uses this as an anchor for face, clothing, and style. Across multiple calls with the same reference image, character identity is preserved even when camera angles, lighting, and actions vary substantially between shots.
What resolution should I target for social media vertical video?
For TikTok, Instagram Reels, and YouTube Shorts, use aspect_ratio: "9:16" at 720p for the best speed-to-quality tradeoff at scale, or 1080p for hero content where quality justifies the extra generation time. The 9:16 aspect ratio is natively supported — no cropping or letterboxing artifacts.
Most AI video tools operate in two steps: generate the visual, then stitch in audio afterward. That two-step process is why so much AI-generated content looks and sounds slightly off, the sound effects land a beat late, lips don't quite match words, and ambient noise feels pasted in.
Seedance 1.5 Pro takes a different approach entirely. Developed by ByteDance's Seed team and published in December 2025, it is a foundational model built from the ground up for native, joint audio-video generation. Audio and video aren't added to each other, they're created together, sharing the same generation process, the same attention layers, the same loss functions.
The result is millisecond-level synchronization between what you see and what you hear: lips that move in precise time with spoken words, ambient sounds that materialize exactly when objects collide on screen, background music that breathes with the pacing of the shot.
Seedance 1.5 Pro API Pricing
Six capabilities that set this model apart from every other video generation API on the market today.
Ambient sounds, action effects, background music, and human voices are generated simultaneously with the video frames, not appended afterward. The dual-branch Diffusion Transformer processes both modalities in parallel, synchronized at the architectural level.
The model understands phonemes, the individual sounds that make up speech, and maps them correctly to lip shapes in real time. This works across English, Mandarin, Japanese, Korean, Spanish, Indonesian, Cantonese, Shanxi dialect, and Sichuan dialect, with each language's natural rhythm preserved.
Specify professional camera movements directly in your prompt: dolly zooms, Hitchcock effects, crane movements, tracking shots, whip pans, and orbits. The model processes compositional language — golden hour lighting, rack focus, shallow depth of field and executes it accurately across the generated clip.
Subtle micro-expressions, a slight swallow, eyes widening, anxiety transitioning to relief, are rendered accurately based on the prompt context and image input. This removes the mechanical stiffness common in AI video. Characters behave, not just move.
When generating multiple clips for the same narrative, the model preserves character identity: faces don't morph, clothing stays consistent, and proportions remain stable even during complex movements or 12-second clips. Provide a reference image as an anchor to lock appearance across a full sequence.
Through multi-stage distillation and quantization, ByteDance achieved a 10x speedup in inference over the base model. What once took 20–30 minutes now takes 2–3 minutes without meaningful quality loss — fast enough for real products, not just demos.
Understanding the architecture helps you write better prompts and predict model behavior. Here's what's actually happening when you make an API call.
The core is a 4.5 billion parameter Dual-Branch Diffusion Transformer. Two parallel branches — one for video frames, one for audio waveforms — run concurrently and share information through cross-modal attention fusion modules. Because both branches see each other's representations during generation, they stay in lock-step from the very first denoising step.
The model was trained on mixed-modal datasets using curriculum-based data scheduling, robust captioning, and semantic enrichment. Pre-training covers text-to-audio-video (T2VA), image-to-audio-video (I2VA), and unimodal tasks (T2V, I2V). This multi-task approach means a single model handles all input modes without switching contexts between API calls.
After pre-training, the team ran Supervised Fine-Tuning on curated high-quality data, followed by Reinforcement Learning from Human Feedback with multi-dimensional reward models calibrated for audio-visual contexts — not just visual preference signals. This is why the model follows complex narrative prompts reliably, rather than generating visually attractive but semantically incoherent clips.
Performance was measured using SeedVideoBench-1.5, an internally developed benchmark covering both the video stream (subjects, motion, interaction, cinematography) and the audio stream (vocal types, non-speech audio properties, synchronization). Evaluation uses both a 5-point Likert scale and pairwise Good-Same-Bad metrics for subjective quality — the same methodology used for professional production content review.
Everything you need to plan your integration before writing a line of code.
These are the real-world categories where native audio-video generation creates the most concrete value.
Generate TikTok and Reels-format content in 9:16 at volume. With character consistency and natural dialogue, content teams can produce multi-episode virtual creator series without actors or studios.
Create the same product demo in English, Japanese, Mandarin, and Spanish from a single source image with native lip-sync in each language. One key visual, six market-ready cuts, no localization agency.
Generate storyboard animatics with camera movements and emotionally expressive characters to communicate director intent to crews before principal photography — faster and cheaper than traditional animatics.
Animate product images into short cinematic demonstrations with voiceover and ambient environment sounds. A still photo of a coffee maker can become a 10-second atmospheric clip with steam, pouring sounds, and narration.
Build interactive story experiences where player choices generate new video clips in real time, each with synchronized dialogue, effects, and music. Game cutscenes without a cutscene budget.
Seedance 1.5 Pro's RLHF training specifically targeted advertising, micro-dramas, and narrative content. Short emotional arcs, dialogue-heavy scenes, and brand voice all come through with production coherence.
The lip-sync capabilities of Seedance 1.5 Pro go deeper than simple language detection. The model was trained on phoneme-level data across each of these languages and dialects, meaning it doesn't just move lips — it moves the correct lip shapes for the actual sounds being made in that language's phonological system.
This is particularly visible in dialect support, where standard Mandarin and Sichuan dialect have genuinely different phoneme distributions. The model handles both distinctly, not as variants of the same thing.
Does Seedance 1.5 Pro always generate audio, or can I get silent video?
Both modes are supported. Set audio: false in your request for a silent clip at the same quality. Audio generation is on by default since it's a core differentiator, but disabling it does not affect video quality and slightly reduces generation time.
How does the character consistency feature work in practice?
Pass a reference image in the image_url field alongside your prompt. The model uses this as an anchor for face, clothing, and style. Across multiple calls with the same reference image, character identity is preserved even when camera angles, lighting, and actions vary substantially between shots.
What resolution should I target for social media vertical video?
For TikTok, Instagram Reels, and YouTube Shorts, use aspect_ratio: "9:16" at 720p for the best speed-to-quality tradeoff at scale, or 1080p for hero content where quality justifies the extra generation time. The 9:16 aspect ratio is natively supported — no cropping or letterboxing artifacts.