Llama 3.1 405B
VS
ChatGPT-4o

The battle on the very frontier of LLM.
Can the strongest open-source model rival ChatGPT-4o?

Benchmarks and specs

Specs

The competition between language models is intense, and two of the main competitors are LLama 3.1 405B and ChatGPT-4o. The models have clashed in the past, with LLama 70B being the tool of choice for many AI devs, although this iteration of models certainly stole even more spotlight from OpenAI. Let's dive into the details, and see where the details lie.

Specification	Llama 3.1 405B	ChatGPT-4o
Сontext Window	128K	128K
Output Tokens	4K	16K
Number of parameters in the LLM	405B	unknown
Knowledge cutoff	December 2023	October 2023
Release Date	July 23, 2024	August 6, 2024
Tokens per second	~29.5	~103

We will be accessing the newest GPT-4o API, updated in August for higher context window. Llama 3.1 405B API returns us only 29.5 t/s, which definitely puts it closer to Claude Opus in its speed and, hopefully, perfomance.

Benchmarks

This benchmark is comprised of official release notes and multiple open-source benchmarks.

Benchmark	Llama 3.1 405B	ChatGPT-4o
Undergraduate level knowledge MMLU	88.6	88.7
Graduate level reasoning GPQA	51.1	53.6
Code Human Eval	89.0	90.2
Math problem-solving MATH	73.8	70.2
Multilingual Math MGSM	91.6	90.5
Reasoning DROP, F1	84.8	83.4

Both models perform well with no big surprises. GPT-4o has seen quite a few updates since then, so it is probably even better - yet the tests are always needed for a specific usecase.

Get API Key

Practical tests

We will be going through some known AI model weakspots, to weed out the problems, and highlight the strong sides. At the end - we'll give you a python snippet that allows comparison of models side-by-side, so that you can lead your own tests for your usecase.
.
Some of the outputs will be edited down for shortness or clarity.

Prompt 1. Just Chatting:

Make 10 sentences. Each sentence must contain exactly 7 words.

‍

Llama 3.1 405B output

GPT-4o output

1. The sun shines brightly in clear skies. ✅
2. Beautiful flowers bloom in the warm weather. ✅
3. Children laugh and play outside all day. ✅
4. The big brown dog wags its tail. ✅
5. The baby smiles at her happy mother. ✅
6. The tall trees sway gently in breeze. ✅
7. The bright blue ocean sparkles with waves. ✅
8. The little girl reads a big book. ✅
9. The happy couple walks on the beach. ✅
10. The soft music plays on the radio. ✅

1. The sun set behind the distant hills. ✅
2. She baked a cake for her friend. ✅
3. The cat chased the mouse all day. ✅
4. Rain poured heavily on the city streets. ✅
5. He forgot his keys on the kitchen counter. ❌
6. The flowers bloomed beautifully in the garden. ✅
7. They watched a movie under the starry sky. ❌
8. The teacher explained the lesson very clearly.✅
9. Birds sang sweetly in the early morning. ✅
10. A gentle breeze rustled through the leaves. ✅

Correct Answers: 10/10 ✅
A good performance, unsurprisingly.

Correct Answers: 8/10❌
Good performance, although doesn't consider "the" as a word.

Prompt 2. Language comprehension:

Read the following sentence and answer the questions:
"The news hit him like a ton of bricks, leaving him stunned
and speechless."

Questions:
What figure of speech is used in this sentence?
What is the effect of this figure of speech on the
reader's understanding of the character's reaction?
How does the phrase "like a ton of bricks" enhance
the description of the character's emotions?

Llama 3.1 405B output

GPT-4o output

Here are the answers:
1. What figure of speech is used in 
this sentence?

The figure of speech used is a simile, 
comparing the impact of the news to a 
physical object (a ton of bricks) 
to convey intensity.
...

1. The figure of speech used is a 
simile.

2. The simile emphasizes the sudden 
and overwhelming impact of the news, 
making the reader understand the 
character's intense shock and inability 
to respond.

...

Correct Answer: ✅
All 3 questions answered correctly. Edited down to remain short.
Reproducable with our Llama 3.1 405B API

Correct Answer: ✅
Completely correct, too.

Prompt 3. Logical reasoning:

You have 4 inflated balloons. 
One of them flew away and the other one deflated. 
Then you bought two more, put one in a drawer, 
and one in the wardrobe one floor below. 

How many balloons do you have left?

Llama 3.1 405B output

GPT-4o output

What a thrilling balloon adventure! 
...
1. One of them flew away, 
leaving you with 4 - 1 = 3 balloons.
2. Another one deflated, 
but it's still a balloon ...
3. You bought 2 more balloons,
bringing the total to .... 5 balloons.
4. You put one of the new balloons 
in a drawer, ... still in your possession.
5. You put the other new balloon 
in the wardrobe one floor below, 
which again means it's still yours, 
but stored in a different location.

So, to summarize:
you have a total of 5 balloons

1. You initially have 
4 inflated balloons.
2. One balloon flies away, 
leaving you with 3 balloons.
3. Another balloon deflates, 
leaving you with 2 balloons.
4. You buy 2 more balloons, 
bringing your total to 4 balloons.
5. You put one balloon in a drawer, 
and another in a wardrobe 
one floor below.

Regardless of their location, 
... you have 4 balloons left.

Correct Answer: ✅
Good understnding of nuance! We edited down all the explanations down to the bone, but it still shows.

Correct Answer: ✅
Correct conclusion with clear, step-by-step reasoning

Prompt 4. Math:

The height of the cone of maximum volume inscribed in a sphere of radius R is ...

Llama 3.1 405B output

GPT-4o output

We want to maximize the volume of 
the cone, which is given by:
V = (1/3)πr²h
The radius of the cone is related
to the radius of the sphere by:
r = R sin(θ)
....
The height of the cone of maximum 
volume inscribed in a sphere 
of radius R is:
h = (4/3)R

The apex of the cone is at the center
of the sphere, and the base of the cone 
is parallel to the plane of the equator 
of the sphere.

Using the Pythagorean theorem in the
right triangle formed by the radius 
of the sphere, the radius of the 
cone's base.
...

To find the maximum volume, 
take the derivative of V with respect
to h and set it to zero:
...
h = \frac{2R}{\sqrt{3}}
Thus, ... 
\( R \) is \( \frac{2R}{\sqrt{3}} \).

Correct Answer: ✅
Quick solution, and a correct one.

Correct Answer: ❌
Incorrect conclusion. Good start, but a mistake in reasoning

Prompt 5. Let's get coding:

Create a simple, colorful Arkanoid game using Python and the Pygame library. 
The game should be implemented in a single module with the following features:

1. A colorful graphical interface with a resolution of 800x600 pixels. 
2. A paddle that the player can move left and right using arrow keys. 
...
10. Implement a simple "Game Over" screen

Provide the complete, well-commented code ...

Llama 3.1 405B output

GPT-4o output

snake game with a score counter, programmed by gpt 4o

snake game, programmed by chat sonnet 3.5 with a counter, pause and a difficulty menu

Correct Answer: ✅
Not a bad result. The Llama 405B model sometimes gets stuck on the loss screen, and the ball behaves strangely, falling through textures (-3 points).

Correct Answer: ✅
GPT handles ball interactions with textures more reliably, avoiding the problem of falling through. However, it has a critical flaw: when the player loses, the game crashes with an error, which makes it impossible to see your result and start the game again (-3 points)

Conclusion

Both models excel in different areas. We threw the trickiest tasks at them, and managed to find some weakspots. Llama 405B Instruct Turbo generally adheres good to prompt requirements and provides more detailed explanations but struggles with minor technical and grammatical issues. GPT-4o shows strong logical reasoning and code handling but is prone to errors that can impact the outcome. Depending on the task, Llama 405B Instruct Turbo might be preferred for its comprehensive approach, while GPT-4o could be better suited for tasks that require reliable interaction handling.

Get API Key

Pricing

The Pricing model is given in AI/ML API tokens. GPT-4o and Llama 3.1 405B have equal input prices with 3x lower output price by Llama.

1k AI/ML Tokens	Llama 3.1 405B	GPT-4o
Input price	$0.0065	$0.065
Output price	$0.0065	$0.0195

Get API Key

Compare for yourself

You've seen these models in action. Now it's your turn to test them for your specific needs. Copy the code below into Google Colab or your preferred coding environment, add your API key, and start experimenting!

import openai
import requests

def main():
    client = OpenAI(
      api_key='<YOUR_API_KEY>',
      base_url="https://api.aimlapi.com",
    )

    # Specify the two models you want to compare
    model1 = 'meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo'
    model2 = 'gpt-4o-mini' 
    selected_models = [model1, model2]

    system_prompt = 'You are an AI assistant that only responds with jokes.'
    user_prompt = 'Why is the sky blue?'
    results= {}
    
    for model in selected_models:
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {'role': 'system', 'content': "be strong"},
                    {'role': 'user', 'content': "who is strong?"}
                ],
            )

            message = response.choices[0].message.content
            results[model] = message
        except Exception as error:
            print(f"Error with model {model}:", error)

    # Compare the results
    print('Comparison of models:\n')
    print(f"{model1}:\n{results.get(model1, 'No response')}")
    print('\n')
    print(f"{model2}:\n{results.get(model2, 'No response')}")

if __name__ == "__main__":
    main()

‍

Conclusion

Overall, GPT-4o and LLama 3.1 405B appear to be equally capable, with LLama 405B being much cheaper on the output, and GPT-4o being much faster. Both performed very well, considering all the prompts were one-shot.

You can check access both Llama 3.1 405B API and gpt-4o API here, or see our full model lineup here - try for yourself, and get a feel for the frontier AI power!

Get API Key