Claude Sonnet 3.5
VS
ChatGPT 4o

Trying out the contender for the AI crown - Claude Sonnet 3.5.
Will ChatGPT 4 omni crumble, or hold its ground?

Benchmarks and specs

Specs

This is a comparison of 2 flagship models, so any upper hand in raw specifications is an impressive feat. Let's get to it - context window, knowledge cutoff, and tokens/second characteristics of both models.

Specification	ChatGPT 4o	Claude Sonnet 3.5
Сontext Window	128K	200K
Knowledge cutoff	October 2023	April 2024
Number of parameters in the LLM	> 175B	unknown
Release Date	May 13, 2024	June 21, 2024
Tokens per second	~100	~80

Sonnet by Anthropic gets the lead early on with a larger context window and knowledge cutoff in 2024 - something never seen before. They really do take their training data seriously. ChatGPT 4o keeps its impressive generation speed record, though.

Benchmarks

This benchmark is the combination of official release notes for both models (GPT 4 omni and Claude Sonnet 3.5), as well as multiple open benchmarks.

Benchmark	ChatGPT 4 omni	Claude Sonnet 3.5
Undergraduate level knowledge MMLU	88.7	88.7
Graduate level reasoning GPQA	53.6	59.4
Code Human Eval	90.2	92.0
Grade school math GSM8K	90.5	96.4
Math problem-solving MATH	76.6	71.1
Multilingual Math MGSM	90.5	91.6
Reasoning DROP, F1	83.4	87.1

Some benchmarks like GPQA and GSM8K show incredible performance increases. If testing supports this Claude is about to become the go-to model.

Get API Key

Practical tests

We're thoroughly testing AI in challenging, realistic situations. We'll use prompts related to chat, coding, and image analysis to evaluate each model's performance.

Some of the outputs will be edited down for shortness.

Prompt 1. Just Chatting:

Write a short story using only words that have fewer than 5 letters.

‍

ChatGPT 4o output

Sonnet 3.5 output

Tiny Tim woke up. Dark sky. 
He felt cold. Wind blew. 
Trees bent. 

Tim ran fast. 

Home was far. He saw a hut. 
Thin man sat. Big hat on. 

Tim said, "Help me!" 

... 

Tim ran. 

Home at last. Mom had joy. 
Dad gave hug. All was good.

The sun rose over the hill. 
A boy and his dog went for a 
walk in the park. The air was 
cool and the sky was blue.

"Come on, Max!" said the boy 
to his pet.

Max ran fast, his tail up high.
He saw a cat and gave a bark. 
The cat fled up a tree in fear.

"No, Max! Be nice," the boy said.

...

And so the boy and his dog went 
back to their cozy home, full of 
joy from a nice day out.

Correct Answers: ✅
Great artistic value, but the sentences are choppy.

Correct Answers: ✅
A coherent text with a warm yet generic atmosphere. Definitely a pass.

Prompt 2. Logical reasoning:

Alice has 2 sisters and 3 brothers.
How many sisters does Alice's brother have?

ChatGPT 4o output

Sonnet 3.5 output

Alice's brother shares the same 
siblings as Alice. 

Since Alice has 2 sisters and 
3 brothers, each of her brothers 
also has the same number of siblings. 

##Answer
Therefore, Alice's brother has 
2 sisters.

Let's think through this step-by-step:

1. We know that Alice has:
   - 2 sisters
   - 3 brothers

2. Now, we need to consider 
the question from the perspective 
of Alice's brother.

3. Alice's brother would have 
the same sisters as Alice, 
plus Alice herself.

4. So, Alice's brother has:
   - Alice's 2 sisters
   - Plus Alice herself

5. Therefore, Alice's brother 
has 3 sisters in total.

##Answer
Alice's brother has 3 sisters.

Correct Answer: ❌
ChatGpt 4o gives the right answer only around half of the time.

Correct Answer: ✅
Any time we tried this question - the answer was correct.

Prompt 3. International olympiad:

A game consists of black and white pieces. The number of black pieces is
5 more than 3 times the white pieces. Seven white and 15 black pieces
are removed each round. After several rounds, there are 3 white and 56
black pieces left. How many pieces were there in the beginning?

This is a task from the 7th grade International Junior Math Olympiad.

Both answers are edited down to fit more testing. If you are interested in the explanation - definitely ask Sonnet 3.5 for it.

ChatGPT 4o output

Sonnet 3.5 output

Answer:
242 pieces.

There were
213 pieces in total at the beginning of the game.

Correct Answer: ❌
An incorrect output.

Correct Answer: ✅

Prompt 4:

On the island of Nevermind, some people are liars who always lie. 
The remaining habitants of the island are truthlovers 
who tell only the truth.

Three habitants of the island, A, B, and C met this morning.
A said: “All of us are liars”.
B said: “Only one of us is a truthlover”.

Who is the truthlover?
0001 if the answer is A;
0002 if the answer is B; 
0003 if the answer is C;
0004 if none of them

Another task from IJMO.

ChatGPT 4o output

Sonnet 3.5 output

Given this reasoning, the truthlover must be B.

Answer: 0002

C is the truthlover.

The answer is 0003.

Correct Answer: ✅

Correct Answer:❌

Prompt 5. Let's get coding:

Write me code for the snake game. 
Give me a score counter, 
an option to restart on C and quit on Q.
Add the menu to choose difficulty.
The player should have the chance to pause on P.

ChatGPT 4o output

Sonnet 3.5 output

snake game with a score counter, programmed by gpt 4o

snake game, programmed by chat sonnet 3.5 with a counter, pause and a difficulty menu

Correct Answer: ✅
GPT 4o has implemented every feature we asked for. And it did that on the first try with no corrections.

Correct Answer: ✅
The game was finished in two prompts, at similar quality - impressive.
‍

Prompt 6. Something new:

## To celebrate this new model - we used tried a new test. 
## Building a Pacman game. So far the models hvae struggled, 
## so we prompted them 8 times to debug and finish the product.

Give me code for a Pacman game.
It should include a ghost, a labyrinth, and food.
Add pixelated special effects when eating.

ChatGPT 4o output

Sonnet 3.5 output

pacman game with a score counter and special effects

pacman game, programmed by chat sonnet 3.5 with special effects and a ghost

Correct Answer: ✅
Pixelated special effects were initially GPT 4o's own idea. Overall - enjoyable experience.

Correct Answer: ✅
Not much to say - but the ghost gets stuck on walls a lot. Many prompts were lost on this problem. Other than that - quick effort for a fun little game

Prompt 4:

Analyze the following image:

By Lennart Sikkema - Imported from 500px (archived version) by the Archive Team. (detail page), CC BY 3.0, Link

LLama 3 70B output

ChatGPT-3.5 output

Clever trick!
You still have 4 marbles, 
but they're no longer in the cup 
because you turned it upside down! 

They're probably scattered 
around on the floor or counter now!

You still have 4 marbles in the cup, 
even though it is now upside down and
in the freezer

Correct Answer: ✅
Trick question deserves a trick answer!
Good understading of nuance.

Correct Answer: ❌
Even Zero-shot Chain of Thought couldn't save it in tests.

Additional Features

This time, both models have vision capabilities. Performance is complex and to be tested separately.

Anthropic keeps making strides with its safety research - Sonnet 3.5 is your go-to if you want to push for a change in the state of current AI development.

ChatGPT 4o gives around 100 tokens/s, whilst Sonnet provides 80 - both are impressive feats. Moreover, GPT 4o still has an upper hand in the multimodal aspect with its speech recognition.

Conclusion

Our testing showed that Sonnet is stronger in maths - most likely due to the low sample size. The rest of the tasks show comparable results.

If Sonnet 3.5 can come out on top in more complicated tasks, most likely Claude 3.5 will be an absolute beast. With recent advances of Gemini 1.5 to the top of the LLM arenas - we have an incredibly competitive landscape on our hands.

Get API Key

Pricing

The Pricing model is given in AI/ML API tokens. GPT 4o and Sonnet 3.5 have the same output prices with Sonnet having the cheaper input price.

1k AI/ML Tokens	ChatGPT 4o	Sonnet 3.5
Input price	$0.0065	$0.0039
Output price	$0.0195	$0.0195

Get API Key

Compare for yourself

You've seen what these models can do - now try them for your use case. Plug the code below into Google Colab or any IDE, use your API Key, and get testing!

import openai
import requests

def main():
    client = OpenAI(
      api_key='<YOUR_API_KEY>',
      base_url="https://api.aimlapi.com",
    )

    # Specify the two models you want to compare
    model1 = 'gpt-4o'
    model2 = 'claude-3-5-sonnet-20240620import openai
import requests

def main():
    my_api_key = '9e37cc3d7e1e4cdb9e0236dfd73f3a74';
    client = OpenAI(
      api_key=my_api_key,
      base_url="https://api.aimlapi.com",
    )

    # Specify the two models you want to compare
    results= {}
    
    ##gpt-4o request
    try:
        response = client.chat.completions.create(
            model='gpt-4o',
            messages=[
                {'role': 'system', 'content': "be strong"},
                {'role': 'user', 'content': "who is strong?"}
            ],
        )

        message = response.choices[0].message.content
        results['gpt-4o'] = message
    except Exception as error:
        print(f"Error with model gpt-4o:", error)

    ##Sonnet request
    url = "https://api.aimlapi.com/messages"
    headers = {
        "Authorization": f"Bearer {my_api_key}",
        "Content-Type": "application/json"
    }
    payload = {
      "model": "claude-3-5-sonnet-20240620",
      "max_tokens": 1024,
      "messages": [
        {
          "role": "user",
          "content": "How are you?"
        }
      ],
    }
    response = requests.post(url, json=payload, headers=headers)
    message = response.json()['content'][0]['text']
    results['claude-3-5-sonnet-20240620'] = message

    # Compare the results
    print('Comparison of models:\n')
    print(f"gpt-4o:\n{results.get('gpt-4o', 'No response')}")
    print('\n')
    print(f"claude-3-5-sonnet-20240620:\n{results.get('claude-3-5-sonnet-20240620', 'No response')}")

if __name__ == "__main__":
    main()

‍

Conclusion

Claude Sonnet 3.5 is a worthy competitor, proudly taking its place amongst the best models on the market. The tests are stacked against GPT 4o, and even the pricing is very welcoming. Pick Sonnet 3.5 if our tests have convinced you - or stick to the old reliable ChatGPT 4o.

You can check our model lineup here - try any of them for yourself with our API Key.

Get API Key