ChatGPT-4o
VS
o1-mini

New and modern GPT-o1 mini has been crushing STEM-related tasks.
Can the time-tested GPT-4o perform as effective as it's update?

Get API Key

Benchmarks and specs

Specs

This is the comparison of two frontier models, performing at the top level in the technical benchmarks. o1-mini is specifically made to succeed at STEM, and based purely off numbers - it stands in the class of its own, outclassng the rest in this regard. However, GPT-4o and OpenAI's o1-mini need to be tested neck-to-neck to really see the difference. This comparison explores this battle, looking in-detail at the models performances, to see which one better suits our needs.

Specification	ChatGPT-4o	o1-mini
Сontext Window	128K	128K
Output Tokens	16K	64K
Number of parameters in the LLM	unknown	unknown
Knowledge cutoff	October 2023	October 2023
Release Date	August 6, 2024	September 12, 2024
Tokens per second	~103	~74

The key differences between o1-mini and GPT-4o lie in their maximum output tokens and processing speed. o1-mini supports a significantly higher output of 65k tokens, while GPT-4o is limited to 16k tokens. However, GPT-4o excels in speed, generating 103 tokens per second compared to o1-mini’s 73.9. Both models share the same input context window, knowledge cutoff, and similar parameter sets. Reportedly, the newer model has much better performance in Arabic and Korean then its predecessor. The real distinction will come from testing these models to see how they perform in practice

Benchmarks

This benchmark is the combination of official release notes for both models, as well as multiple open benchmarks.

Benchmark	ChatGPT-4o	o1-mini
Undergraduate level knowledge MMLU	88.7	85.2
Graduate level reasoning GPQA	53.6	60
Code Human Eval	90.2	92.4
Codeforces competition Codeforces	900	1650
Cybersecurity CTFs	20.0	28.7
Math problem-solving MATH	70.2	90.0

GPT-4o holds its ground in undergraduate reasoning, but o1-mini captures the rest of the spectrum, succeeding in coding, maths and other technical tasks.

Get API Key

Practical tests

We are carefully examining AI in difficult, real-world scenarios. We will use questions about chatting, coding, and picture analysis to assess each model's ability.

Some of the outputs will be edited down for shortness or clarity.

Prompt 1. Logical questions:

Alice has N sisters and M brothers. 
How many sisters does Andrew, the brother of Alice have?

‍

ChatGPT-4o output

o1-mini output

Alice has N sisters and M brothers. 

Since Andrew is one of Alice's 
brothers, he would have the same 
number of sisters and brothers 
as Alice, 
except he would count himself 
among the brothers.

#Answer
So, Andrew has N sisters.

To determine how many sisters Andrew 
has, let's break down the information 
provided:
...
   - The total number of sisters 
     in the family is **N + 1** 
     (Alice plus her N sisters).
...

##Answer
So, Andrew has N + 1 sisters.

Correct Answer ❌
With a flaw in reasoning gpt-4o provides a faulty answer.

Correct Answer ✅
O1-mini demonstrates greater reasoning ability and accuracy in logical judgments compared to GPT-4o.

Prompt 2. Language comprehension:

How many 'r's are there in the word 'strawberry'?

ChatGPT-4o output

o1-mini output

#Answer
The word "strawberry" contains 2 'r's

Assistant: The word "strawberry" 
contains three letter 'r's.

Here's the breakdown:

- S
- T
- R (1st 'r')
- A
- W
- B
- E
- R (2nd 'r')
- R (3rd 'r')
- Y

#Answer
So, there are 3 'r's in "strawberry."

Correct Answer: ❌ (0/5)
The model couldn't answer correctly even one out of five times. However - this can be easily fixed by slight prompt engineering - adding the phrase "think step-by-step".

Correct Answer: ✅ (4/5)
Correct answers in 4 out of 5 cases.

Prompt 3. Maths:

Alice and Bob play the following game. 
A stack of n tokens lies before them. 
The players take turns with Alice going first. 
On each turn, the player removes either 1 token or 4 tokens from the stack. 
Whoever removes the last token wins. 

Find the number of positive integers n <= 2024 for which 
there exists a strategy for Bob that guarantees 
that Bob will win the game regardless of Alice's play.

ChatGPT-4o output

o1-mini output

A position is a winning position
for a player if they can force a win 
no matter what the opponent does.

We will analyze a few cases to derive 
a pattern.

We start by analyzing small values of n 
to identify losing positions for Alice.
...
n = 1 + 4k.
Now, 1, 5, 9, 13, ... ≤ 2024 
define these losing positions. 

Solve
4k + 1 ≤ 2024 ⇒ 4k ≤ 2023 ⇒ k ≤ 505
Therefore, the sequence is k = 0 to 505 
which represents 506 valid terms and thus, 
such values of n.

'Answer'
506

To determine the number of positive
integers n we analyze the game using 
combinatorial game theory.
...
'Winning and Losing Positions:'
   - Winning Position (N-position): 
     A position from which the player 
     can force a win.
     
   - Losing Position (P-position): 
     A position from which every move 
     leads to a winning position 
     for the opponent.

'Identifying P-positions:'
   - By analyzing small values of n, 
     we observe a pattern 
     where the P-positions satisfy 
     n ≡ 0 or n ≡ 2 (mod 5).

Counting P-positions up to 2024
Multiples of 5 (n ≡ 0 (mod 5)): 404.

Numbers congruent to 2 modulo 5 
(n ≡ 2 (mod 5)): 405.


'Answer'
405 + 409 = 809

Correct Answer: ❌

Correct Answer: ✅
GPT-1o mini also excels in solving math problems, whereas GPT-4o tends to make errors in reasoning.

Prompt 4. Let's get coding:

Objective: Create a Tetris game using Python and the Pygame library. 
The game should include classic Tetris gameplay, 
with blocks of different shapes falling from the top of the screen. 

The player can move, rotate, and place the blocks, 
aiming to complete horizontal lines to score points.

Requirements:
Game Board:
Tetrominoes:
Game Logic:
Scoring:
Game Over:
Controls:
UI Elements:

With such tasks it is paramount to have very detailed requirements, which you can write out with an AI model itself.

ChatGPT-4o output

o1-mini output

snake game, programmed by chat sonnet 3.5 with a counter, pause and a difficulty menu

snake game with a score counter, programmed by gpt 4o

Correct Answer: ❌
The game started, but all we got was a black screen with the grid visible.

Correct Answer: ✅
It's a decent result, but some text and the following figure are not visible.

Prompt 6. Frontend:

Write a slider for images (image1.png, image2.png, image3.png) 
that the user can control using only HTML and CSS

ChatGPT-4o output

o1-mini output

Correct Answer: ✅
The slider functions properly, but there are a couple of issues: clicking right on the first image skips to the third, and clicking left on the second image doesn't work as expected.

Correct Answer: ❌
Surprisingly, the slider doesn't work well, it scrolls through all the pictures

Prompt 4:

Analyze the following image:

By Lennart Sikkema - Imported from 500px (archived version) by the Archive Team. (detail page), CC BY 3.0, Link

LLama 3 70B output

ChatGPT-3.5 output

Clever trick!
You still have 4 marbles, 
but they're no longer in the cup 
because you turned it upside down! 

They're probably scattered 
around on the floor or counter now!

You still have 4 marbles in the cup, 
even though it is now upside down and
in the freezer

Correct Answer: ✅
Trick question deserves a trick answer!
Good understading of nuance.

Correct Answer: ❌
Even Zero-shot Chain of Thought couldn't save it in tests.

Conclusion

These tests are the hardest we've thrown at any model - especially the coding part. And as you can see - one of the two models is always ready to solve the tasks you provide. We'd recommend ChatGPT-4o for more straight-forward tasks, and o1-mini for convoluted STEM stuff, or tasks requiring high output.

Get API Key

Pricing

This time o1-mini is more expensive, which isn't easy to achieve when comparing with gpt-4o. Input prices are similar, with output being higher by around 20% for o1-mini.

1M Tokens	GPT-4o	o1-mini
Input price	$2,625	$3,15
Output price	$10,5	$12,6

Get API Key

Compare for yourself

You've seen these models in action. Now it's your turn to test them for your specific needs. Copy the code below into Google Colab or your preferred coding environment, add your API key, and start experimenting!

import openai
import requests

def main():
    client = OpenAI(
      api_key=aiml_api_key,
      base_url="https://api.aimlapi.com",
    )

    # Specify the two models you want to compare
    model1 = 'gpt-4o-2024-08-06'
    model2 = 'o1-mini' 
    selected_models = [model1, model2]

    user_prompt = 'Why is the sky blue?'
    results= {}
    
    for model in selected_models:
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {'role': 'user', 'content': "who is strong?"}
                ],
                max_tokens=2000,
            )
            print(response)
            message = response.choices[0].message.content
            results[model] = message
        except Exception as error:
            print(f"Error with model {model}:", error)

    # Compare the results
    print('Comparison of models:\n')
    print(f"{model1}:\n{results.get(model1, 'No response')}")
    print('\n')
    print(f"{model2}:\n{results.get(model2, 'No response')}")

if __name__ == "__main__":
    main()

‍

Conclusion

O1-mini proves to be the more capable model for tasks requiring complex reasoning, math problem-solving, and precise coding. It consistently performs better across benchmarks and practical tests. However, GPT-4o shows strength in coding tasks with less complex designs, excelling in tasks like HTML/CSS sliders and simpler coding scenarios. The choice between these models depends on the nature of the task: o1-mini for advanced logic and problem-solving, and GPT-4o for general knowledge and simpler coding tasks.

You can access both o1-mini and the latest snapshot of ChatGPT-4o API, or see our full model lineup here - try for yourself, and get a feel for the frontier AI power!

Get API Key

‍

Also check other comparisons with these models:
ChatGPT-4o vs LLama 405B
ChatGPT-4o vs Sonnet 3.5
ChatGPT-4o vs Qwen2
ChatGPT-4o mini vs LLama 8B