The competition between language models is intense, and two of the main competitors are LLama 3.1 405B and ChatGPT-4o. The models have clashed in the past, with LLama 70B being the tool of choice for many AI devs, although this iteration of models certainly stole even more spotlight from OpenAI. Let's dive into the details, and see where the details lie.
We will be accessing the newest GPT-4o API, updated in August for higher context window. Llama 3.1 405B API returns us only 29.5 t/s, which definitely puts it closer to Claude Opus in its speed and, hopefully, perfomance.
This benchmark is comprised of official release notes and multiple open-source benchmarks.
Both models perform well with no big surprises. GPT-4o has seen quite a few updates since then, so it is probably even better - yet the tests are always needed for a specific usecase.
We will be going through some known AI model weakspots, to weed out the problems, and highlight the strong sides. At the end - we'll give you a python snippet that allows comparison of models side-by-side, so that you can lead your own tests for your usecase.
.
Some of the outputs will be edited down for shortness or clarity.
Make 10 sentences. Each sentence must contain exactly 7 words.
1. The sun shines brightly in clear skies. ✅
2. Beautiful flowers bloom in the warm weather. ✅
3. Children laugh and play outside all day. ✅
4. The big brown dog wags its tail. ✅
5. The baby smiles at her happy mother. ✅
6. The tall trees sway gently in breeze. ✅
7. The bright blue ocean sparkles with waves. ✅
8. The little girl reads a big book. ✅
9. The happy couple walks on the beach. ✅
10. The soft music plays on the radio. ✅
1. The sun set behind the distant hills. ✅
2. She baked a cake for her friend. ✅
3. The cat chased the mouse all day. ✅
4. Rain poured heavily on the city streets. ✅
5. He forgot his keys on the kitchen counter. ❌
6. The flowers bloomed beautifully in the garden. ✅
7. They watched a movie under the starry sky. ❌
8. The teacher explained the lesson very clearly.✅
9. Birds sang sweetly in the early morning. ✅
10. A gentle breeze rustled through the leaves. ✅
Read the following sentence and answer the questions:
"The news hit him like a ton of bricks, leaving him stunned
and speechless."
Questions:
What figure of speech is used in this sentence?
What is the effect of this figure of speech on the
reader's understanding of the character's reaction?
How does the phrase "like a ton of bricks" enhance
the description of the character's emotions?
Here are the answers:
1. What figure of speech is used in
this sentence?
The figure of speech used is a simile,
comparing the impact of the news to a
physical object (a ton of bricks)
to convey intensity.
...
1. The figure of speech used is a
simile.
2. The simile emphasizes the sudden
and overwhelming impact of the news,
making the reader understand the
character's intense shock and inability
to respond.
...
You have 4 inflated balloons.
One of them flew away and the other one deflated.
Then you bought two more, put one in a drawer,
and one in the wardrobe one floor below.
How many balloons do you have left?
What a thrilling balloon adventure!
...
1. One of them flew away,
leaving you with 4 - 1 = 3 balloons.
2. Another one deflated,
but it's still a balloon ...
3. You bought 2 more balloons,
bringing the total to .... 5 balloons.
4. You put one of the new balloons
in a drawer, ... still in your possession.
5. You put the other new balloon
in the wardrobe one floor below,
which again means it's still yours,
but stored in a different location.
So, to summarize:
you have a total of 5 balloons
1. You initially have
4 inflated balloons.
2. One balloon flies away,
leaving you with 3 balloons.
3. Another balloon deflates,
leaving you with 2 balloons.
4. You buy 2 more balloons,
bringing your total to 4 balloons.
5. You put one balloon in a drawer,
and another in a wardrobe
one floor below.
Regardless of their location,
... you have 4 balloons left.
The height of the cone of maximum volume inscribed in a sphere of radius R is ...
We want to maximize the volume of
the cone, which is given by:
V = (1/3)πr²h
The radius of the cone is related
to the radius of the sphere by:
r = R sin(θ)
....
The height of the cone of maximum
volume inscribed in a sphere
of radius R is:
h = (4/3)R
The apex of the cone is at the center
of the sphere, and the base of the cone
is parallel to the plane of the equator
of the sphere.
Using the Pythagorean theorem in the
right triangle formed by the radius
of the sphere, the radius of the
cone's base.
...
To find the maximum volume,
take the derivative of V with respect
to h and set it to zero:
...
h = \frac{2R}{\sqrt{3}}
Thus, ...
\( R \) is \( \frac{2R}{\sqrt{3}} \).
Create a simple, colorful Arkanoid game using Python and the Pygame library.
The game should be implemented in a single module with the following features:
1. A colorful graphical interface with a resolution of 800x600 pixels.
2. A paddle that the player can move left and right using arrow keys.
...
10. Implement a simple "Game Over" screen
Provide the complete, well-commented code ...
Analyze the following image:
Clever trick!
You still have 4 marbles,
but they're no longer in the cup
because you turned it upside down!
They're probably scattered
around on the floor or counter now!
You still have 4 marbles in the cup,
even though it is now upside down and
in the freezer
Both models excel in different areas. We threw the trickiest tasks at them, and managed to find some weakspots. Llama 405B Instruct Turbo generally adheres good to prompt requirements and provides more detailed explanations but struggles with minor technical and grammatical issues. GPT-4o shows strong logical reasoning and code handling but is prone to errors that can impact the outcome. Depending on the task, Llama 405B Instruct Turbo might be preferred for its comprehensive approach, while GPT-4o could be better suited for tasks that require reliable interaction handling.
The Pricing model is given in AI/ML API tokens. GPT-4o and Llama 3.1 405B have equal input prices with 3x lower output price by Llama.
You've seen these models in action. Now it's your turn to test them for your specific needs. Copy the code below into Google Colab or your preferred coding environment, add your API key, and start experimenting!
import openai
import requests
def main():
client = OpenAI(
api_key='<YOUR_API_KEY>',
base_url="https://api.aimlapi.com",
)
# Specify the two models you want to compare
model1 = 'meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo'
model2 = 'gpt-4o-mini'
selected_models = [model1, model2]
system_prompt = 'You are an AI assistant that only responds with jokes.'
user_prompt = 'Why is the sky blue?'
results= {}
for model in selected_models:
try:
response = client.chat.completions.create(
model=model,
messages=[
{'role': 'system', 'content': "be strong"},
{'role': 'user', 'content': "who is strong?"}
],
)
message = response.choices[0].message.content
results[model] = message
except Exception as error:
print(f"Error with model {model}:", error)
# Compare the results
print('Comparison of models:\n')
print(f"{model1}:\n{results.get(model1, 'No response')}")
print('\n')
print(f"{model2}:\n{results.get(model2, 'No response')}")
if __name__ == "__main__":
main()
Overall, GPT-4o and LLama 3.1 405B appear to be equally capable, with LLama 405B being much cheaper on the output, and GPT-4o being much faster. Both performed very well, considering all the prompts were one-shot.
You can check access both Llama 3.1 405B API and gpt-4o API here, or see our full model lineup here - try for yourself, and get a feel for the frontier AI power!