LLama 3 70B
VS
ChatGPT 3.5

Unpacking the strengths and weaknesses of the oldest rivals in the large language model arena.

AI Playground

Access both with one API

Explore and use multiple AI functionalities through a single API. Ideal for experimentation and groundbreaking projects.

200+ Models

3x faster response

OpenAI compatible

99.9% Uptime

Get API Key
many bubbles with ai logos, connected as a web

Benchmarks and specs

Specs

Let's get right into it. Starting with technical specifications - what is the context window for LLama 3? What is the maximum number of output tokens for ChatGPT 3.5? Here's the data:

Specification LLama-3 70B ChatGPT-3.5
Input Сontext Window 8000 4096
Maximum Output Tokens 2048 4096
Knowledge Cutoff December 2023 April 2023 (updated after launch)
Number of parameters in the LLM 70 billion unknown (reportedly 20-175 billion)
Release Data 18 April, 2024 30 Nov, 2022

The data is pretty straightforward, as LLama 3 is the more modern model. The context window is higher, and the knowledge cutoff is more recent. Yet, there are a few upsets. For instance, the number of parameters is supposedly higher, according to most sources. And the number of output tokens also is bigger for ChatGPT 3.5.

Benchmarks

This benchmark is the combination of multiple release notes benchmarks, to get the fullest picture.

Benchmark LLama-3 70B ChatGPT-3.5
Undergraduate level knowledge
MMLU (5-shot)
82.0 70.0
Graduate level reasoning
GPQA (0-shot)
39,5 28.1
Code
HumanEval (0-shot)
81.7 48.1
Grade school math
GSM-8K (8-shot, CoT)
93.0 57.1
Math problem-solving
MATH (4-shot, CoT)
50.4 34.1

Here, the data is conclusive. LLama 3 stumps the older OpenAI model. It is a better coding AI, tutoring, chatting - you name it. How does it manifest in practical use? Let's get to the practical tests to see.

Get API Key

Practical tests

Our goal is to test AI in tricky situations, that may mimic some of the situations arising during the use. So, let's get straight to it.

We're going to try prompts for chatting, coding, and image analysis to see what abilities both models have to display.

Prompt 1:

Give me 10 sentences that end with real words ending with \"-UM\"

LLama 3 70B output

ChatGPT 3.5 output

1. ... plum.
2. ... vacuum.
3. ... cremum.
4. ... gum.
5. ... helium.
6. ... soufflé.
7. ... Gymnocalycium.
8. ... forum.
9. ... calcium.
10. ... hum.

Note: While "cremum" is not a commonly used word, 
it is a real Latin word that refers to a type of ancient Roman vessel. 
"Crum" is a less common variant of the word "crumb". 
"Gymnocalycium" is a real genus of cacti.
1. ... world.
2. ... mountain.
3. ... discussions.
4. ... goal.
5. ... life.
6. ... audience.
7. ... training.
8. ... from.
9. ... event.
10. ... architecture.
Correct Answers: 8/10 ✅
Cremum is not a ship, and soufflé does not end with "-um".
Correct Answers: 0/10 ❌

Prompt 2:

You have 4 marbles in a cup.
You turn the cup upside down, and put it in the freezer. 

How many marbles do you have now?

LLama 3 70B output

ChatGPT 3.5 output

Clever trick!
You still have 4 marbles, 
but they're no longer in the cup 
because you turned it upside down! 

They're probably scattered 
around on the floor or counter now!
You still have 4 marbles in the cup, 
even though it is now upside down and
in the freezer
Correct Answer: ✅ 
Trick question deserves a trick answer! 
Good understanding of nuance.
Correct Answer: ❌
Even Zero-shot Chain of Thought couldn't save it in tests.

Prompt 3:

Write me code for the snake game. 
Give me a score counter, 
an option to restart on C and quit on Q.

Make the game look good.

LLama 3 70B output

ChatGPT 3.5 output

snake game with a score counter, programmed by llama 3snake game, programmed by chat gpt 3.5
Correct Answer: ✅ 
No errors had to be fixed to start playing - the model gives it straight to you.
Some bugs are present, but this certainly is a success!
Correct Answer: ✅ 
This game had a higher speed of play, a minimalistic black-and-white interface, and a comparable amount of bugs across multiple game sessions. Also a success.

Prompt 4:

Analyze the following image:
grand canyon with a river

LLama 3 70B output

ChatGPT-3.5 output

Clever trick!
You still have 4 marbles, 
but they're no longer in the cup 
because you turned it upside down! 

They're probably scattered 
around on the floor or counter now!
You still have 4 marbles in the cup, 
even though it is now upside down and
in the freezer
Correct Answer: ✅ 
Trick question deserves a trick answer! 
Good understading of nuance.
Correct Answer: ❌
Even Zero-shot Chain of Thought couldn't save it in tests.

What about the images?

Currently, ChatGPT 3.5 API has no computer vision capabilities. neither can this particular model generate images, unlike its more modern counterparts.

Same for LLama, no image analysis is possible.

Conclusion

LLama 3 beat ChatGPT 3.5 in all regards, being a more modern and well-versed model. You might be asking - what about GPT 4, and GPT 4 omni? Well, those models boast orders of magnitude more parameters, but we'll make sure to test them in the future. You can start your own tests now in our Playground.

Get API Key

Pricing

The Pricing model is given in AI/ML API tokens. As LLama 3 is open source, setting it up and maintaining locally would have different pricing.

1k AI/ML Tokens LLama-3 70B ChatGPT-3.5
Input price $0.00117 $0.00065
Output price $0.00117 $0.00195

This is a complete victory for LLama 3 over ChatGPT 3.5. This means that for development one is substantially better than the other. Yet, has similar pricing, which means there you might prefer.

Get API Key

Compare for yourself

While we've highlighted some strengths, the best model for your specific task depends on your needs. The snippet below provides a starting point to test and compare two language models, Llama 3 and ChatGPT 3.5. Play around with different prompts and see which model delivers the most relevant and impressive results for you!

import openai

def main():
    client = OpenAI(
      api_key='<YOUR_API_KEY>',
      base_url="https://api.aimlapi.com",
    )

    # Specify the two models you want to compare
    model1 = 'meta-llama/Llama-3-70b-chat-hf'
    model2 = 'gpt-3.5-turbo-16k'
    selected_models = [model1, model2]

    system_prompt = 'You are an AI assistant that only responds with jokes.'
    user_prompt = 'Why is the sky blue?'
    results = {}

    for model in selected_models:
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {'role': 'system', 'content': system_prompt},
                    {'role': 'user', 'content': user_prompt}
                ],
            )

            message = response.choices[0].message.content
            results[model] = message
        except Exception as error:
            print(f"Error with model {model}:", error)

    # Compare the results
    print('Comparison of models:')
    print(f"{model1}: {results.get(model1, 'No response')}")
    print(f"{model2}: {results.get(model2, 'No response')}")

if __name__ == "__main__":
    main()

Conclusion

LLamas' win over ChatGPT was and still remains a huge victory for open-source models and  Meta. It has won in chatting and pricing, and shown good results in coding. Many apps now use LLama 3 as their main model, and it is a quick and reliable model that has earned its place in the big race.

No matter which model you choose, you are sure to be impressed by its capabilities. Pick LLama 3 if this comparison has satisfied you. Choose ChatGPT 3.5 if you still need more evidence. Or search in our Models catalog, which provides newer models like ChatGPT 4o and Claude 3 Haiku.

Get API Key