0.6435
1.859
Chat
Active

ERNIE 4.5 VL

ERNIE 4.5 VL is a series of vision‑language models (VLMs) built on Baidu’s ERNIE 4.5 multimodal MoE architecture, jointly trained on text and images for rich perception and reasoning.
ERNIE 4.5 VLTechflow Logo - Techflow X Webflow Template

ERNIE 4.5 VL

ERNIE 4.5 VL empowers developers and businesses to build intelligent systems that seamlessly integrate visual and textual information.

What is ERNIE 4.5 VL API?

ERNIE 4.5 VL is part of Baidu’s ERNIE 4.5 family, a suite of multimodal models capable of interpreting complex data across multiple modalities. The VL variant specifically merges vision and language processing into a unified framework, allowing the model to analyze images, comprehend text, and generate detailed responses that connect both modalities. This model is suitable for applications ranging from interactive visual Q&A and content generation to document interpretation and visual reasoning. As an open-source solution under the Apache 2.0 license, it offers broad flexibility for research, development, and commercial applications.

Why ERNIE 4.5 VL Stands Out

ERNIE 4.5 VL combines cutting-edge multimodal reasoning with practical deployment flexibility. It offers open-source accessibility under Apache 2.0, a scalable architecture that ranges from efficient 28B models to high-capacity 424B models, and the ability to handle extremely long contexts for complex tasks. Its ecosystem includes tools and frameworks that accelerate development and integration, making it one of the most versatile solutions for vision-language AI available today.

Model Variants

The ERNIE 4.5 family spans dense and Mixture‑of‑Experts (MoE) models, with the VL branch focused on vision‑language tasks. Within VL, two key API‑relevant variants are:

ERNIE 4.5 VL 28B A3B
  • Total parameters: ~28B
  • Activated parameters per token: ~3B
  • Architecture: Mixture‑of‑Experts (MoE) with heterogeneous multimodal design
  • Context length: 30K
  • Designed for efficient reasoning with image + text in a lighter compute footprint.

Pricing

Input: $0.1859 per 1M tokens

Output: $0.7436 per 1M tokens

ERNIE 4.5 VL 424B A47B
  • Total parameters: ~424B
  • Activated parameters per token: ~47B
  • Architecture: MoE with modality‑isolated routing and high‑capacity expert layers
  • Context length: 123K
  • Tailored for high‑precision multimodal reasoning at scale.

Pricing

Input: $0.5577 per 1M tokens

Output: $1.677 per 1M tokens

ERNIE 4.5 Turbo VL 32K

  • Total parameters: ~424B
  • Activated parameters per token: ~47B
  • Architecture: MoE with modality-isolated routing and high-capacity expert layers
  • Context length: 32K
  • Tailored for high-precision multimodal reasoning, combining image understanding with text generation for extended documents, complex visual Q&A, and large-scale multimodal workflows.

Pricing

Input: $0.6435 per 1M tokens

Output: $1.859 per 1M tokens

Key Features

Multimodal understanding

  • Joint processing of text prompts and images for grounded Q&A, captioning, retrieval-augmented reasoning and visual explanation.​
  • Robust document and chart analysis: extraction of structure, trends, and insights from PDFs, reports, infographics and UI screenshots.​

Advanced reasoning

  • Thinking mode with explicit multi-step reasoning over visual content, including zooming into regions, cross-referencing elements and performing symbolic reasoning (e.g., math over charts).​
  • Strong performance on STEM-style tasks that integrate equations, diagrams and textual descriptions.

Use Cases

Multimodal Q&A and assistants

  • Knowledge assistants answering questions based on uploaded images, diagrams, dashboards, or scanned documents.
  • Technical support bots that read UI screenshots, error dialogs, schematics or machine photos to guide troubleshooting and repair workflows.

Document and data analysis

  • Automated analysis of PDF reports, financial statements, contracts and legal documents with embedded tables and figures.​
  • Insight extraction from business dashboards and charts: summarizing trends, highlighting anomalies, and generating executive briefs grounded in the visual data.​

E‑commerce and marketing

  • Product understanding from photos plus descriptions to generate detailed, attribute-rich listings and comparisons.​
  • Visual A/B testing analysis: understanding ad creatives, infographics or landing page screenshots and relating them to performance metrics described in text.

Education, research and STEM

  • Step-by-step reasoning over textbook figures, lab experiment photos, plots and math diagrams.​
  • Interactive tutoring agents that combine text explanations with visual references and annotations over images.

Comparison with Other Models

vs Qwen2.5 VL

  • Performance focus: ERNIE 4.5 VL 28B A3B matches or exceeds Qwen2.5 VL 7B and Qwen2.5 VL‑32B on many multimodal benchmarks while using fewer activated parameters, making it a strong choice where parameter efficiency matters.​
  • Reasoning mode: ERNIE’s explicit thinking mode and heterogeneous MoE routing emphasize structured visual reasoning, whereas Qwen2.5‑VL lines are more conventional dense or smaller MoE designs without the same “thinking with images” paradigm.

What is ERNIE 4.5 VL API?

ERNIE 4.5 VL is part of Baidu’s ERNIE 4.5 family, a suite of multimodal models capable of interpreting complex data across multiple modalities. The VL variant specifically merges vision and language processing into a unified framework, allowing the model to analyze images, comprehend text, and generate detailed responses that connect both modalities. This model is suitable for applications ranging from interactive visual Q&A and content generation to document interpretation and visual reasoning. As an open-source solution under the Apache 2.0 license, it offers broad flexibility for research, development, and commercial applications.

Why ERNIE 4.5 VL Stands Out

ERNIE 4.5 VL combines cutting-edge multimodal reasoning with practical deployment flexibility. It offers open-source accessibility under Apache 2.0, a scalable architecture that ranges from efficient 28B models to high-capacity 424B models, and the ability to handle extremely long contexts for complex tasks. Its ecosystem includes tools and frameworks that accelerate development and integration, making it one of the most versatile solutions for vision-language AI available today.

Model Variants

The ERNIE 4.5 family spans dense and Mixture‑of‑Experts (MoE) models, with the VL branch focused on vision‑language tasks. Within VL, two key API‑relevant variants are:

ERNIE 4.5 VL 28B A3B
  • Total parameters: ~28B
  • Activated parameters per token: ~3B
  • Architecture: Mixture‑of‑Experts (MoE) with heterogeneous multimodal design
  • Context length: 30K
  • Designed for efficient reasoning with image + text in a lighter compute footprint.

Pricing

Input: $0.1859 per 1M tokens

Output: $0.7436 per 1M tokens

ERNIE 4.5 VL 424B A47B
  • Total parameters: ~424B
  • Activated parameters per token: ~47B
  • Architecture: MoE with modality‑isolated routing and high‑capacity expert layers
  • Context length: 123K
  • Tailored for high‑precision multimodal reasoning at scale.

Pricing

Input: $0.5577 per 1M tokens

Output: $1.677 per 1M tokens

ERNIE 4.5 Turbo VL 32K

  • Total parameters: ~424B
  • Activated parameters per token: ~47B
  • Architecture: MoE with modality-isolated routing and high-capacity expert layers
  • Context length: 32K
  • Tailored for high-precision multimodal reasoning, combining image understanding with text generation for extended documents, complex visual Q&A, and large-scale multimodal workflows.

Pricing

Input: $0.6435 per 1M tokens

Output: $1.859 per 1M tokens

Key Features

Multimodal understanding

  • Joint processing of text prompts and images for grounded Q&A, captioning, retrieval-augmented reasoning and visual explanation.​
  • Robust document and chart analysis: extraction of structure, trends, and insights from PDFs, reports, infographics and UI screenshots.​

Advanced reasoning

  • Thinking mode with explicit multi-step reasoning over visual content, including zooming into regions, cross-referencing elements and performing symbolic reasoning (e.g., math over charts).​
  • Strong performance on STEM-style tasks that integrate equations, diagrams and textual descriptions.

Use Cases

Multimodal Q&A and assistants

  • Knowledge assistants answering questions based on uploaded images, diagrams, dashboards, or scanned documents.
  • Technical support bots that read UI screenshots, error dialogs, schematics or machine photos to guide troubleshooting and repair workflows.

Document and data analysis

  • Automated analysis of PDF reports, financial statements, contracts and legal documents with embedded tables and figures.​
  • Insight extraction from business dashboards and charts: summarizing trends, highlighting anomalies, and generating executive briefs grounded in the visual data.​

E‑commerce and marketing

  • Product understanding from photos plus descriptions to generate detailed, attribute-rich listings and comparisons.​
  • Visual A/B testing analysis: understanding ad creatives, infographics or landing page screenshots and relating them to performance metrics described in text.

Education, research and STEM

  • Step-by-step reasoning over textbook figures, lab experiment photos, plots and math diagrams.​
  • Interactive tutoring agents that combine text explanations with visual references and annotations over images.

Comparison with Other Models

vs Qwen2.5 VL

  • Performance focus: ERNIE 4.5 VL 28B A3B matches or exceeds Qwen2.5 VL 7B and Qwen2.5 VL‑32B on many multimodal benchmarks while using fewer activated parameters, making it a strong choice where parameter efficiency matters.​
  • Reasoning mode: ERNIE’s explicit thinking mode and heterogeneous MoE routing emphasize structured visual reasoning, whereas Qwen2.5‑VL lines are more conventional dense or smaller MoE designs without the same “thinking with images” paradigm.
Try it now

400+ AI Models

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Best Growth Choice
for Enterprise

Get API Key
Testimonials

Our Clients' Voices