
ERNIE 4.5 VL empowers developers and businesses to build intelligent systems that seamlessly integrate visual and textual information.
ERNIE 4.5 VL is part of Baidu’s ERNIE 4.5 family, a suite of multimodal models capable of interpreting complex data across multiple modalities. The VL variant specifically merges vision and language processing into a unified framework, allowing the model to analyze images, comprehend text, and generate detailed responses that connect both modalities. This model is suitable for applications ranging from interactive visual Q&A and content generation to document interpretation and visual reasoning. As an open-source solution under the Apache 2.0 license, it offers broad flexibility for research, development, and commercial applications.
ERNIE 4.5 VL combines cutting-edge multimodal reasoning with practical deployment flexibility. It offers open-source accessibility under Apache 2.0, a scalable architecture that ranges from efficient 28B models to high-capacity 424B models, and the ability to handle extremely long contexts for complex tasks. Its ecosystem includes tools and frameworks that accelerate development and integration, making it one of the most versatile solutions for vision-language AI available today.
The ERNIE 4.5 family spans dense and Mixture‑of‑Experts (MoE) models, with the VL branch focused on vision‑language tasks. Within VL, two key API‑relevant variants are:
Input: $0.1859 per 1M tokens
Output: $0.7436 per 1M tokens
Input: $0.5577 per 1M tokens
Output: $1.677 per 1M tokens
Input: $0.6435 per 1M tokens
Output: $1.859 per 1M tokens
ERNIE 4.5 VL is part of Baidu’s ERNIE 4.5 family, a suite of multimodal models capable of interpreting complex data across multiple modalities. The VL variant specifically merges vision and language processing into a unified framework, allowing the model to analyze images, comprehend text, and generate detailed responses that connect both modalities. This model is suitable for applications ranging from interactive visual Q&A and content generation to document interpretation and visual reasoning. As an open-source solution under the Apache 2.0 license, it offers broad flexibility for research, development, and commercial applications.
ERNIE 4.5 VL combines cutting-edge multimodal reasoning with practical deployment flexibility. It offers open-source accessibility under Apache 2.0, a scalable architecture that ranges from efficient 28B models to high-capacity 424B models, and the ability to handle extremely long contexts for complex tasks. Its ecosystem includes tools and frameworks that accelerate development and integration, making it one of the most versatile solutions for vision-language AI available today.
The ERNIE 4.5 family spans dense and Mixture‑of‑Experts (MoE) models, with the VL branch focused on vision‑language tasks. Within VL, two key API‑relevant variants are:
Input: $0.1859 per 1M tokens
Output: $0.7436 per 1M tokens
Input: $0.5577 per 1M tokens
Output: $1.677 per 1M tokens
Input: $0.6435 per 1M tokens
Output: $1.859 per 1M tokens