Vision tests
Llama 3.2 90B
Vs
OpenAI GPT-4o

Comparison of LLama 3.2 90B Vision and GPT-4o Vision, two advanced multimodal AI models for visual and language tasks.

Introduction

Recent advancements in multimodal AI have introduced two powerful models: LLama 3.2 90B Vision and GPT-4o Vision. LLama 3.2 90B Vision excels in complex visual reasoning and high-dimensional data analysis, leveraging its 90 billion parameters for intricate computational tasks. GPT-4o Vision, meanwhile, focuses on integrating robust language and visual processing capabilities, making it ideal for cross-modal applications and context-aware output generation.

Architecture: LLama 3.2 90B Vision uses a transformer-based architecture optimized for visual data, while GPT-4o Vision employs a multi-modal transformer design.

In this article, we’ll compare their features and performance to determine which model is best suited for various use cases.

Benchmarks and specs

Specs

This is a specification comparison of two advanced multimodal AI models, LLama 3.2 90B Vision and GPT-4o Vision

Specification	LLama 3.2 90B Vision	GPT-4o Vision
Input modalities	Text + Image	Text + Image + Audio + Video
Output modalities	Text	Text
Input Context Window	128K tokens	128K tokens
Maximum Output Tokens	-	16K tokens
Number of parameters in the LLM	90B	175B
Knowledge cutoff	December 2023	October 2023
Release Date	September 25, 2024	May 13, 2024
Output tokens per second	47.5	111

The main differences between LLama 3.2 90B Vision and GPT-4o Vision lie in their input modalities and processing speeds. LLama 3.2 90B Vision supports text and image inputs, while GPT-4o Vision extends to text, image, audio, and video. With 175 billion parameters, GPT-4o Vision outstrips LLama's 90 billion and processes tokens faster at 111 tokens per second compared to LLama’s 47.5. Both models share an input context window of 128K tokens, but GPT-4o Vision supports up to 16K output tokens. Their knowledge cutoffs are December 2023 for LLama and October 2023 for GPT-4o.

These distinctions highlight GPT-4o Vision's broader applicability and enhanced performance, particularly for tasks requiring diverse input types and rapid processing.

Benchmarks

This benchmark is the combination of official release notes for both models (LLama 3.2 90B Vision and GPT-4o Vision), as well as multiple open benchmarks.

Benchmark	LLama 3.2 90B Vision	GPT-4o Vision
Massive Multi-discipline Multimodal Understanding MMMU	60.3	69.1
Chart question answering ChartQA	85.5	85.7
Visual question answering AI2 diagram	91.1	94.8
Document Visual Question Answering DocVQA	90.1	88.4
Math Reasoning in Visual Contexts MathVista	57.3	63.8

The benchmark results reveal a nuanced performance comparison between GPT-4o Vision and LLama 3.2 90B Vision.

GPT-4o Vision demonstrates superior capabilities in several critical domains:

Multimodal Understanding (MMMU): GPT-4o Vision scores 69.1, significantly outperforming LLama's 60.3.
Visual Question Answering (AI2 diagram): GPT-4o Vision achieves 94.8, edging out LLama's 91.1.
Math Reasoning in Visual Contexts (MathVista): GPT-4o Vision shows a clear advantage with 63.8 compared to LLama's 57.3.

However, LLama 3.2 90B Vision maintains its competitive edge in certain areas:

Document Visual Question Answering (DocVQA): LLama excels with 90.1, surpassing GPT-4o Vision's 88.4.
Chart Question Answering (ChartQA): Both models perform nearly identically, with LLama at 85.5 and GPT-4o Vision at 85.7.

These results suggest that while GPT-4o Vision offers superior performance in tasks requiring complex multimodal integration and visual reasoning, LLama 3.2 90B Vision remains highly competitive, particularly in document analysis and chart interpretation.

‍

AI Playground

Practical tests

1. Text from image

Write the text from the image and say where it came from?

‍

"The Hobbit" by J.R.R. Tolkien. Chapter III

‍

LLama 3.2 90B Vision output:

The text from the image is from Chapter III of The Silver Chair by C.S. Lewis.
The book is a fantasy novel and the third book in The Chronicles of Narnia series.

Here is the full text of the page:

**Chapter III**

**A SHORT REST**

They did not sing or tell stories that day, even though the weather improved;
nor the next day, nor the day after. They had begun to feel that danger was 
not far away on either side. They camped under the stars, and their horses 
had more to eat than they had; for there was plenty of grass, but there was
not much in their bags, even with what they had got from the trolls. One 
morning they forded a river at a wide shallow place, full of the noise of 
stones and foam. The far bank was steep and slippery. When they got to the 
top of it, leading their ponies, they saw that the great mountains had marched
down very near to them. Already they seemed only a day's easy journey from the
feet of the nearest. Dark and drear it looked, though there were patches of sunlight
on its brown sides, and behind its shoulders the tips of snow-peaks gleamed.

"Is that The Mountain?" asked Bilbo in a solemn voice, looking at it with
round eyes. He had never seen a thing that looked so big before.

Correct Answer ❌
The model incorrectly identified the source of this passage and failed to provide the complete text.

GPT-4o Vision output:

The image contains text from "The Hobbit" by J.R.R. Tolkien. Here is the text:
---
Chapter III
A SHORT REST

They did not sing or tell stories that day, even though the weather improved;
nor the next day, nor the day after. They had begun to feel that danger was 
not far away on either side. They camped under the stars, and their horses had
more to eat than they had; for there was plenty of grass, but there was not much
in their bags, even with what they had got from the trolls. One morning they
forded a river at a wide shallow place full of the noise of stones and foam. 
The far bank was steep and slippery. When they got to the top of it, leading 
their ponies, they saw that the great mountains had marched down very near to them.
Already they seemed only a day's easy journey from the feet of the nearest.
Dark and drear it looked, though there were patches of sunlight on its brown sides,
and behind its shoulders the tips of snow-peaks gleamed.

"Is that The Mountain?" asked Bilbo in a solemn voice, looking at it with round eyes.
He had never seen a thing that looked so big before.

"Of course not!" said Balin. "That is only the beginning of the Misty Mountains,
and we have got to get through, or over, or under those somehow, before we can come
into Wilderland beyond. And it is a deal of a way even from the other side of them 
to the Lonely Mountain in the East where Smaug lies on our treasure."

Correct Answer ✅
The model has successfully completed the task in its entirety.

‍

2. Venn diagram:

What is this diagram about?

LLama 3.2 90B Vision output:

**Diagram Overview**

The diagram presents a Venn diagram with overlapping circles, each representing a distinct concept.
The center circle is labeled **"Product"**, while the outer circles are labeled **"Business"**,
**"Technology"**, and **"Human"**.

**Circle Labels**

*   **Business**
*   **Technology**
*   **Human**

**Venn Diagram Analysis**

The Venn diagram illustrates the intersection of these concepts, highlighting the relationships and
overlaps between them. The diagram suggests that a product is not just about one aspect but rather a
combination of multiple factors.

**Interpretation**

This diagram can be interpreted in various ways, depending on the context in which it is used. 
One possible interpretation is that a successful product requires a balance between business, technology,
and human elements. For instance, a product may need to be valuable, feasible, and usable to be considered
successful.

**Conclusion**

In conclusion, the diagram provides a visual representation of the interconnectedness of business,
technology, and human factors in creating a successful product. It highlights the importance of considering
multiple aspects when developing a product and emphasizes the need for balance and harmony among these
elements.

Correct Answers: ✅
‍
- Detailed description of the diagram
- Offers broad interpretations
- Misses specific labels (valuable, feasible, usable)
- Somewhat repetitive and wordy

GPT-4o Vision output:

This diagram is a Venn diagram illustrating the intersection of three essential aspects in product
development: Business, Technology, and Human. It emphasizes that a successful product should be
valuable (Business), feasible (Technology), and usable (Human). The central overlap indicates a 
product that integrates all three aspects effectively.

Correct Answer ✅
‍
- Concise and to-the-point
- Correctly identifies specific labels (valuable, feasible, usable)
- Clearly explains the central overlap
- Less detailed on diagram structure
- Fewer broader implications

‍

3. United States GDP Chart:

What does this chart tell you? Guess why there was a decline

LLama 3.2 90B Vision output:

This chart shows the growth rate from 1985 to 2017 and a moving average over ten years. 
The growth rate has declined in different years due to various factors. Here are some 
possible reasons for the decline:

*   **1991:** The growth rate declined in 1991 due to the Gulf War, which led to a
decrease in oil prices and a recession in many countries.
*   **2001:** The growth rate declined in 2001 due to the dot-com bubble burst, which
led to a decline in technology stocks and a recession in the United States.
*   **2009:** The growth rate declined in 2009 due to the global financial crisis, which
led to a decline in credit availability and a recession in many countries.
*   **2015-2016:** The growth rate declined in 2015-2016 due to the decline in oil prices
and the slowdown in China's economy.

Overall, the growth rate has been declining in different years due to various factors,
including global economic trends, political events, and technological changes.

Correct Answers: ✅‍
LLama 3.2 90B Vision provides specific years of decline (1991, 2001, 2009, 2015-2016) and reasons for each. It mentions the 10-year moving average but includes unnecessary detail about 2015-2016.

GPT-4o Vision output:

This chart shows the growth rate over time, from 1985 to 2017, and includes a 10-year moving
average trend line. Generally, there are fluctuations in the growth rate, with some years
experiencing significant drops.
Here are some potential reasons for the decline in growth rate during certain periods:
1. **Early 1990s Recession**: Economic downturns in the early 1990s could have caused the drop
around 1991.
2. **Dot-Com Bubble Burst (2000-2002)**: The decline around 2001-2002 may be linked to the
bursting of the dot-com bubble, leading to an economic slowdown.
3. **Global Financial Crisis (2008)**: The significant downturn around 2008 coincides with
the global financial crisis, which severely impacted economies worldwide.
The moving average smooths out these fluctuations, showing an overall trend of slower growth
over time. Other variables like economic policies, technological changes, geopolitical events,
and demographic shifts could also influence these trends

Correct Answer ✅
GPT-4o Vision identifies key periods of decline (early 1990s, 2000-2002, 2008) with reasons. It mentions the 10-year moving average, notes the overall trend of slower growth, and suggests additional factors that could influence trends

‍

4. Maths prompt:

Find angle 3.

LLama 3.2 90B Vision output:

**Step 1: Identify the given angles**

*   Angle 1 = 52 degrees
*   Angle 2 = 38 degrees

**Step 2: Find the sum of interior angles in a triangle**

*   Sum of interior angles = 180 degrees

**Step 3: Calculate angle 3**

*   Angle 3 = 180 - (angle 1 + angle 2)
*   Angle 3 = 180 - (52 + 38)
*   Angle 3 = 180 - 90
*   Angle 3 = 90 degrees

**Answer:** 90 degrees

Correct Answers: ❌
‍

GPT-4o Vision output:

In the given diagram, you need to find angle 3. Here's how we can approach it step by step:
Rectangle TXWY
∠TYZ = 52°

Step-by-Step Solution
In rectangle TXWY:
∠TYW = 90° (right angle property)
In right triangle TYZ:
∠YTZ = 90° - ∠TYZ = 90° - 52° = 38°
Properties of triangles TYZ and TWZ:

∠TYZ = 52°
∠YTZ = 38°
∠WZY = 90°

In triangle TWZ:
∠TWZ = 90° (given TW ∥ XY and WZ ⊥ TW)
Parallel lines property:
∠TWZ = ∠YZX = 90°
Conclusion:
∠3 = ∠WTZ = ∠YTZ = 38°
(Note: ∠4 = ∠ZYW = ∠TYZ = 52°)

Therefore, angle 3 is 38°.

Correct Answer ✅

‍

5. Image recognition

There are five zebras shown in the picture

What is shown in the picture? Determine their number

LLama 3.2 90B Vision output:

There are five zebras shown in the picture.

Correct Answers: ✅‍

GPT-4o Vision output:

There are 6 zebras drinking from a water source in the image.

Correct Answer ❌
‍
Both models identified the animal, but GPT-4o incorrectly counted its number.

‍

6. Code from sketch

The image shows a sketch for a website. It has 4 sections: 
1) Header with text on the left and an image on the right
2) Menu section (Home, Services, About, Contact) - elements must be centered
3) Content - use Lorem Ipsum placeholder for the text
4) Content with images (each row contains 2 elements - image and text) - use 
Lorem Ipsum placeholder for the text. 
For images, use this link: https://via.placeholder.com/400x300. 
Write the code for it.

LLama 3.2 90B Vision output:

The image shows a sketch for a website. It has 4 sections:

Correct Answers: ✅ (4/5)

GPT-4o Vision output:

Correct Answers: ✅ (5/5)
‍
Both models captured the concept of the sketch well, but Llama did not center the navigation elements.

‍

7. Cells under a microscope

What is shown in the image? Identify the class, genus and domain

LLama 3.2 90B Vision output:

The image shows a microscopic view of a cell. The cell is oval-shaped, with a smooth outer
membrane and a granular interior. There are several small, dark spots scattered throughout
the cell, which may be vacuoles or other organelles.

To identify the class, genus, and domain of this cell, we need to consider its characteristics:


*   **Domain:** This cell is likely a eukaryote, as it has a true nucleus and other membrane-bound
organelles.
*   **Class:** Based on its shape and size, this cell is probably a protozoan. Protozoa are
single-celled organisms that are typically motile and have complex structures.
*   **Genus:** Without more information, it is difficult to determine the specific genus of
this cell. However, some common genera of protozoa include Paramecium, Amoeba, and Euglena.

Therefore, based on the characteristics visible in the image, the class, genus, and domain of
this cell are:

*   **Domain:** Eukarya
*   **Class:** Protozoa
*   **Genus:** Unknown

Correct Answers: ❌‍

GPT-4o Vision output:

The image shows a *Paramecium*, a type of single-celled organism. Here is its classification:
- **Domain**: Eukaryota
- **Genus**: Paramecium
- **Class**: Oligohymenophorea
Paramecia are often studied in biology due to their easily observable cellular processes.

Correct Answer ✅
‍
Both models identified the animal, but GPT-4o incorrectly counted its number.

‍

Get API Key

Conclusion

In conclusion, while both LLama 3.2 90B Vision and GPT-4o Vision are strong contenders in the multimodal AI space, their strengths cater to different use cases. GPT-4o Vision, with its higher parameter count and faster token processing speed, outperforms in complex multimodal tasks and visual question answering.

However, LLama 3.2 90B Vision demonstrates superior performance in document visual question answering and offers a more economical pricing structure, making it a viable option for projects with a tighter budget. Ultimately, the choice between these models depends on the specific requirements of the task at hand, whether it’s precision in visual analysis, speed, or cost efficiency. You can even consider a faster and cheaper model with 11B parameters Llama 3.2 11B Vision.

‍

You can access both LLama 3.2 90B Vision and the latest snapshot of ChatGPT-4o API, or see our full model lineup here - try for yourself, and get a feel for the frontier AI power!

Get API Key