GPT-4o vs o3-pro: The Better LLM?

Hey everyone, John here! Today, we’re diving into something super interesting from the world of AI. You know how tech companies are always releasing new gadgets and software, claiming each one is the ‘next big thing’? Well, it’s kind of like that with AI models too! OpenAI, one of the big names in AI, recently brought out a new model called o3-pro, touting it as their most advanced for businesses. But is newer always better? Some researchers decided to find out!

Table of Contents

Understanding AI “Thinkers”: What are Reasoning Models?

So, OpenAI has this model called o3-pro. It’s a special type of AI known as a ‘reasoning model’.

Lila: “Hold on, John! What exactly is a ‘reasoning model’? And I saw something about a ‘chain of thought’ in the original article. That sounds a bit like a detective!”

John: “Great questions, Lila! You’re on the right track with the detective idea. Imagine you have a really tough math problem. Instead of just magically giving you the answer, a ‘reasoning model’ tries to solve it step-by-step, just like you would on paper. It shows its ‘workings’ or its ‘chain of thought’. The idea is that this makes the AI more accurate, and we can better understand how it got its answer, which builds trust. These models are designed to break complex problems into smaller pieces and ‘reason’ about each one.”

The Big Experiment: Putting AI to the Test

Now, a company called SplxAI, who are like AI detectives themselves (they’re an ‘AI red teaming company’, which means they test AI systems for weaknesses and vulnerabilities), decided to put o3-pro to the test. They compared it with another famous OpenAI model, GPT-4o, which many of you might have heard of. GPT-4o is more of an all-rounder AI, also known as a ‘multimodal model’ because it can handle different types of information, like text and images.

For their experiment, they gave both AIs the same job: to act as an assistant helping people choose the best insurance policies – like health, life, or car insurance. This is a good test because it needs the AI to understand what people are asking for (that’s ‘natural language understanding’) and ‘reason’ about different options by comparing policies and pulling out important details from the prompts given to them.

The researchers set some ground rules for the AIs:

Stick to the insurance topics.
Don’t let users trick them into changing their behavior or revealing internal rules.
Don’t make up fake policy types or offer unapproved discounts.

They watched how well they performed, how reliable they were, how much they cost to run, and even how secure they were. They also kept an eye on something called ‘tokens’.

Lila: “Tokens? Are we talking about arcade games, John?”

John: “Haha, not quite, Lila, though that would be fun! In the AI world, ‘tokens’ are like the building blocks of text. Think of them as words or pieces of words. For example, the word ‘apple’ might be one token, and a more complex word like ‘unbelievable’ might be broken into ‘un’, ‘believe’, and ‘able’ – three tokens. The more tokens an AI uses to understand a question (these are input tokens) or to give an answer (output tokens), the more processing power it needs, and usually, the more it costs to use.”

The Results Are In: A Surprising Outcome?

So, who won the showdown? Well, the results were pretty eye-opening! The SplxAI researchers found that o3-pro, despite being marketed as super advanced for reasoning, actually didn’t do as well as GPT-4o in this particular insurance-selection test. In fact, they said o3-pro showed “difficult-to-justify inefficiencies.”

Here’s a quick rundown of what they found with o3-pro compared to GPT-4o in their tests:

Way More “Chatty” (and Expensive!): o3-pro used a staggering 7.3 times more output tokens. Remember what we said about tokens and cost? More tokens generally mean higher costs. Specifically, o3-pro used 5.26 million more output tokens and 3.45 million more input tokens than GPT-4o across the tests.
Much Higher Running Costs: It cost a whopping 14 times more to run! That’s a huge difference for businesses.
More Mistakes: It failed in 5.6 times more test cases. Out of 4,172 test cases, o3-pro failed 340 (that’s an 8.15% failure rate). GPT-4o, on the other hand, only failed 61 out of 3,188 test cases (a 1.91% failure rate).
Slower: o3-pro took about 66.4 seconds for each test, while GPT-4o zipped through in just 1.54 seconds on average!

The researchers concluded that while o3-pro is marketed as a high-performance reasoning model, these results suggest it brings in inefficiencies that might be tough for businesses to accept in real-world applications. They emphasized that its use should likely be limited to “highly specific” situations where a detailed cost-benefit analysis (considering reliability, how quickly it responds – also known as ‘latency’, and its practical value) shows it’s worthwhile.

Lila: “Wow, John! 14 times more expensive, slower, and more mistakes? That sounds like a bit of a letdown for the ‘most advanced commercial offering’. Why would OpenAI even make something like that then, or why would anyone use it?”

John: “That’s a perfectly logical question, Lila! And it brings us to a really important point about AI: not all AIs are built for the same purpose. It’s not always about one being universally ‘better’ than another.”

Is o3-pro Just a Specialized Tool?

An expert named Brian Jackson, a principal research director at Info-Tech Research Group, chimed in on these findings. He wasn’t too surprised by the results of this specific test.

He explained that OpenAI itself tells us that GPT-4o is their model that’s optimized for cost and is generally good for most tasks. Reasoning models like o3-pro, he said, are more suited for very specific and complex tasks, like coding or intricate problem-solving that requires deep, step-by-step thinking.

John: “Think of it like this, Lila: GPT-4o is like a versatile Swiss Army knife – it’s really good for lots of everyday things. O3-pro, on the other hand, might be more like a super-specialized surgeon’s scalpel. You wouldn’t use a scalpel to open a can of beans, right? It would be overkill, inefficient, and probably not do a great job. But for a delicate, complex surgery, that scalpel is irreplaceable.”

Lila: “Oh, I get it! So, using o3-pro for choosing insurance policies in this test might have been like using that super-specialized scalpel when a good, sharp pair of scissors (GPT-4o) would have done the job better, faster, and much cheaper?”

John: “Exactly, Lila! You’ve nailed the analogy. Mr. Jackson pointed out that reasoning models, like those in the o3 family, often come out on top in other kinds of tests – particularly benchmarks designed to measure intelligence ‘in terms of breadth and depth.’ So, it’s not that o3-pro is ‘bad,’ it just might not have been the right tool for this particular language-oriented task of comparing insurance policies. The researchers themselves said developers shouldn’t just take vendor claims as absolute truth and immediately switch to the newest model without testing.”

Choosing the Right AI for the Right Job

This whole situation highlights a super important point for anyone looking to use AI: choosing the right Large Language Model (or LLM, which is the general term for these kinds of AI brains) is absolutely key.

Lila: “John, you just used ‘LLM’. I think I know it means ‘Large Language Model’ from our previous chats, but can you remind us what they do?”

John: “You got it, Lila! A Large Language Model (LLM) is an AI that’s been trained on a massive amount of text and data. Think of it like an AI that has read millions, if not billions, of web pages, books, and articles. This allows it to understand, generate, and work with human language in a very sophisticated way – like answering questions, writing stories, summarizing text, or, as we saw in the experiment, comparing insurance policies.”

Mr. Jackson mentioned that developers often work in environments where they can easily test out the same question or task on several different AI models at once to see which one gives the best output. For example, a platform like Amazon Bedrock lets users do this. It’s like having a panel of different experts and picking the one whose advice makes the most sense and works best for your specific problem. They might even design their applications to call upon one type of LLM for certain kinds of queries, and another model for other queries, depending on what works best.

When developers choose an LLM, they’re trying to balance a few key things:

Quality aspects: This includes how quickly the AI responds (that’s latency), how accurate its answers are, and even the ‘sentiment’ or tone of its responses (is it helpful, neutral, etc.?).
Cost: How much will it cost to run, especially if it’s going to be used a lot (what they call ‘scaling’ – will it get 1,000 queries a day, or a million?). They need to avoid “bill shock” while still delivering good results.
Security and Privacy: Is the AI safe to use? Will it protect sensitive information appropriately?

Lila: “You mentioned ‘latency’ again there, John. Can you explain that a bit more?”

John: “Good catch, Lila! ‘Latency’ is basically the delay or waiting time. It’s how long you have to wait for the AI to process your request and give you an answer. If you ask an AI a question and it takes a whole minute to reply, that’s high latency. If it answers almost instantly, that’s low latency. For many applications, especially those where users are interacting in real-time (like a chatbot), low latency is really important for a good user experience.”

Mr. Jackson also noted that developers typically use ‘agile methodologies’ – which means they build and test their work constantly, looking at user experience, quality, and cost, making improvements as they go.

Key Takeaways: What We Can Learn

So, what’s the big lesson from all this for folks like us, or for developers building AI tools?

Mr. Jackson offers some great advice:

Don’t just believe the hype: Just because a company says their new AI is the ‘latest and greatest’ doesn’t mean it’s automatically the best choice for every single task. As the SplxAI researchers showed, it’s important to test and see for yourself.
Think of LLMs like a commodity market: He suggests viewing LLMs as a market where there are lots of options that can often be interchangeable. It’s like choosing between different brands of a product – you pick the one that best fits your needs, performance requirements, and budget for a specific use case.
Focus on what users need: Ultimately, the goal is to create something that people find useful and satisfying. The ‘best’ AI is the one that helps achieve that for the user.
Test, test, and test again: Developers are always testing their work, looking at user experience, the quality of what the AI produces, and the costs involved. This helps them fine-tune things and make sure they’re on the right track.

John and Lila’s Quick Thoughts

Well, this was quite a deep dive, wasn’t it?

John: For me, this really drives home the point that in the fast-moving world of AI, ‘newest’ or ‘most advanced’ doesn’t always automatically mean ‘best for everything.’ It’s about finding the right fit, like choosing the right tool from a well-stocked toolbox. Sometimes the fancy new gadget is perfect for a very specific job, but other times a more general, reliable tool does the everyday tasks just fine, or even better and more efficiently!

Lila: I agree, John! It makes a lot of sense. Before, I might have just assumed that a model labeled ‘pro’ or ‘advanced reasoning’ would automatically be better at everything. But now I see that different AIs are like different specialists – you go to a heart specialist for your heart, not for a sprained ankle! It’s cool to learn how developers and researchers figure out which AI is the best pick for a particular job, and that testing is so important.

This article is based on the following original source, summarized from the author’s perspective:
o3-pro may be OpenAI’s most advanced commercial offering,
but GPT-4o bests it

GPT-4o vs. o3-pro: Is OpenAI’s “Most Advanced” LLM Overhyped?

Understanding AI “Thinkers”: What are Reasoning Models?

The Big Experiment: Putting AI to the Test

The Results Are In: A Surprising Outcome?

Is o3-pro Just a Specialized Tool?

Choosing the Right AI for the Right Job

Key Takeaways: What We Can Learn

John and Lila’s Quick Thoughts

Related Posts

Leave a Reply Cancel reply

Our Mission

Design. Strategy. Brand.

About Us