256K
0.26
1.495
Chat
Active

Step 3.7 Flash

A multimodal Mixture-of-Experts model from StepFun with native image and video understanding, optimized for fast inference, reasoning, and agent workflows with a 256K context window.
Step 3.7 FlashTechflow Logo - Techflow X Webflow Template

Step 3.7 Flash

Step 3.7 Flash is StepFun's multimodal MoE model with text, image and video understanding, a 256K context window, and selectable reasoning depth — optimized for fast inference, agentic workflows, and visual analysis.

What exactly is Step 3.7 Flash?

Step 3.7 Flash is StepFun's latest multimodal Mixture-of-Experts model, built for agentic coding, visual understanding, and long-context tasks. It pairs a 196B-parameter language backbone with a 1.8B Vision Transformer encoder for native image and video processing.

The MoE architecture activates approximately 11B parameters per token during inference — keeping compute cost close to an 11B dense model while maintaining a 198B total parameter budget. The model ships with selectable reasoning depth, tool use, and multimodal fusion as production defaults.

API Pricing

  • Input: $0.26 / 1M tokens
  • Output: $1.495 / 1M tokens

Architecture: what makes it fast and capable

Sparse MoE (mixture of experts)Each token is routed to a specialized subset of expert sub-networks within the 198B parameter space. With only ~11B parameters active per forward pass, the model achieves frontier-adjacent reasoning at a fraction of the compute cost of a comparable dense model.

Native multimodal fusionText, images, and video frames are processed in a single forward pass through a dedicated 1.8B Vision Transformer encoder. There are no separate vision adapters — the model was trained on multimodal tokens from scratch, enabling natural cross-modal reasoning.

Selectable reasoning depthCallers can set reasoning intensity to low, medium, or high per request. Low is optimized for speed and cost; high applies deeper chain-of-thought computation suited for complex coding, planning, and visual analysis tasks.

256K context windowPass entire codebases, document collections, long conversation histories, or multi-frame video sequences in a single request — no chunking required for most workloads.

Core capabilities

Text, image & video inputInclude screenshots, charts, documents, and short video clips alongside text in the same prompt. Natively understood — not approximated by a separate model.

Agentic coding workflowsScores 56.3% on SWE-Bench Pro and 76.5% on SWE-Bench Verified. Supports Advisor Mode — runs the full agentic loop and escalates to a larger model only at planning or failure recovery points, reaching 97% of Claude Opus 4.6 performance at ~1/9th the per-task cost.

Visual analysis95.3% on V* with Python Tool, 89.1% on HR-Bench 4K, 61.9% on Android Daily (beats Kimi K2.6 and GLM 5V Turbo). Handles high-resolution images, GUI inspection, bounding box analysis, and screenshot-based debugging.

Search and research71.7% on ResearchRubrics (vs GPT 5.5 at 61.5%), 92.8% on DeepSearchQA. Search is integrated into the reasoning loop — not a separate add-on.

Thinking preservationReasoning traces persist across turns, reducing redundant computation in iterative development and multi-step planning workflows.

Benchmark performance

Who should use Step 3.7 Flash?

Agent developersEngineers building multi-step coding agents that need consistent behavior across different scaffolds (KiloCode, OpenClaw, Claude Code, RooCode). Step 3.7 Flash narrows cross-harness variance to 64.5–71.5% — more predictable than most alternatives.

Visual AI product teamsApplications that process images, screenshots, charts, or video alongside text — from UI-aware agents to document intelligence pipelines — without managing a separate vision model.

Research and analysis teamsTeams running competitive analysis, literature reviews, or multi-source synthesis tasks where both text and visual inputs need to be processed together coherently.

Cost-conscious production teamsAt $0.26/M input, Step 3.7 Flash delivers multimodal reasoning and vision at a fraction of frontier model pricing. Advisor Mode brings SWE-Bench Verified performance close to Claude Opus 4.6 at $0.19/task vs $1.76/task.

Long-context workloadsLegal, finance, and research teams processing large documents or multi-frame video that exceed standard context windows. The 256K window eliminates most chunking pipelines entirely.

What exactly is Step 3.7 Flash?

Step 3.7 Flash is StepFun's latest multimodal Mixture-of-Experts model, built for agentic coding, visual understanding, and long-context tasks. It pairs a 196B-parameter language backbone with a 1.8B Vision Transformer encoder for native image and video processing.

The MoE architecture activates approximately 11B parameters per token during inference — keeping compute cost close to an 11B dense model while maintaining a 198B total parameter budget. The model ships with selectable reasoning depth, tool use, and multimodal fusion as production defaults.

API Pricing

  • Input: $0.26 / 1M tokens
  • Output: $1.495 / 1M tokens

Architecture: what makes it fast and capable

Sparse MoE (mixture of experts)Each token is routed to a specialized subset of expert sub-networks within the 198B parameter space. With only ~11B parameters active per forward pass, the model achieves frontier-adjacent reasoning at a fraction of the compute cost of a comparable dense model.

Native multimodal fusionText, images, and video frames are processed in a single forward pass through a dedicated 1.8B Vision Transformer encoder. There are no separate vision adapters — the model was trained on multimodal tokens from scratch, enabling natural cross-modal reasoning.

Selectable reasoning depthCallers can set reasoning intensity to low, medium, or high per request. Low is optimized for speed and cost; high applies deeper chain-of-thought computation suited for complex coding, planning, and visual analysis tasks.

256K context windowPass entire codebases, document collections, long conversation histories, or multi-frame video sequences in a single request — no chunking required for most workloads.

Core capabilities

Text, image & video inputInclude screenshots, charts, documents, and short video clips alongside text in the same prompt. Natively understood — not approximated by a separate model.

Agentic coding workflowsScores 56.3% on SWE-Bench Pro and 76.5% on SWE-Bench Verified. Supports Advisor Mode — runs the full agentic loop and escalates to a larger model only at planning or failure recovery points, reaching 97% of Claude Opus 4.6 performance at ~1/9th the per-task cost.

Visual analysis95.3% on V* with Python Tool, 89.1% on HR-Bench 4K, 61.9% on Android Daily (beats Kimi K2.6 and GLM 5V Turbo). Handles high-resolution images, GUI inspection, bounding box analysis, and screenshot-based debugging.

Search and research71.7% on ResearchRubrics (vs GPT 5.5 at 61.5%), 92.8% on DeepSearchQA. Search is integrated into the reasoning loop — not a separate add-on.

Thinking preservationReasoning traces persist across turns, reducing redundant computation in iterative development and multi-step planning workflows.

Benchmark performance

Who should use Step 3.7 Flash?

Agent developersEngineers building multi-step coding agents that need consistent behavior across different scaffolds (KiloCode, OpenClaw, Claude Code, RooCode). Step 3.7 Flash narrows cross-harness variance to 64.5–71.5% — more predictable than most alternatives.

Visual AI product teamsApplications that process images, screenshots, charts, or video alongside text — from UI-aware agents to document intelligence pipelines — without managing a separate vision model.

Research and analysis teamsTeams running competitive analysis, literature reviews, or multi-source synthesis tasks where both text and visual inputs need to be processed together coherently.

Cost-conscious production teamsAt $0.26/M input, Step 3.7 Flash delivers multimodal reasoning and vision at a fraction of frontier model pricing. Advisor Mode brings SWE-Bench Verified performance close to Claude Opus 4.6 at $0.19/task vs $1.76/task.

Long-context workloadsLegal, finance, and research teams processing large documents or multi-frame video that exceed standard context windows. The 256K window eliminates most chunking pipelines entirely.

Try it now

500+ AI Models

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Best Growth Choice
for Enterprise

Get API Key
Testimonials

Our Clients' Voices