AI Gaming Benchmark: Claude 3.7 Destroys GPT-4o!

Are Video Games the Ultimate AI Benchmark?

For years, evaluating AI models has relied on benchmarks like MMLU, Chatbot Arena, and SWE-Bench Verified. But as AI evolves, even experts like Andrej Karpathy question whether these methods still hold up.

His latest concern? AI benchmarks are losing their reliability—and the answer to better evaluation may lie in video games.

After all, AI has a long history in gaming. DeepMind‘s AlphaGo changed the world by defeating Go champions. OpenAI dominated Dota 2, proving AI could outthink human players in strategy games.

Now, researchers at UC San Diego’s Hao AI Lab have taken things further. They built an open-source “gaming agent” AI to test large language models (LLMs) in real-time puzzle-solving games—starting with Super Mario Bros..

The results? Claude 3.7 played for a full 90 seconds—crushing OpenAI’s GPT-4o, which died almost immediately.

Claude 3.7 Outperforms OpenAI and Google in Super Mario

The GamingAgent project, available for open-source download, lets AI models control a game character using natural language.

🔹 Claude 3.7 Sonnet lasted an impressive 90 seconds.
🔹 GPT-4o died in just 20 seconds—defeated by the first enemy!
🔹 Google’s Gemini 1.5 Pro and Gemini 2.0 performed poorly, struggling with basic movement.

GPT-4o: The AI That Can’t Jump

Imagine a player so bad at gaming that they die within seconds. That’s GPT-4o in Super Mario.

💀 First attempt: Killed by the first enemy, like a total beginner.

💀 Second attempt: Barely made progress, stopping every two steps.

💀 Third attempt: Got stuck under a pipe for 10 seconds before eventually dying.

For an AI model that boasts advanced reasoning, GPT-4o’s performance was shockingly bad.

Google’s Gemini: The “Two-Step Jump” Strategy

Gemini 1.5 Pro also failed right at the first enemy. But in its second attempt, it showed some improvement—hitting a ?-block and grabbing a Super Mushroom.

However, it developed a weird habit: jumping every two steps—whether necessary or not.

🚀 Jumped 9 times in a short distance

🚀 Hopped over pipes, ground, and even empty spaces

🚀 Made it further than GPT-4o but still fell into a pit

Gemini 2.0 Flash was slightly better, jumping more smoothly and reaching a higher platform. But it still failed to escape a pit near the fourth pipe—ending its game.

Claude 3.7: The Super Mario Prodigy?

Unlike OpenAI and Google’s models, Claude 3.7 played like an actual gamer.

✔ Only jumped when necessary (to avoid obstacles or gaps)

✔ Avoided enemies by precise jumps

✔ Discovered a hidden star power-up!

✔ Reached the furthest point compared to all other AIs

Claude 3.7 even outplayed Gemini 2.0 Flash, which previously held the best record. While Gemini failed at a pit, Claude not only cleared it but also grabbed extra coins and engaged new enemies like Koopa Troopas.

Can AI Master More Complex Games?

Mario isn’t the only test. The researchers also evaluated AI models on Tetris and 2048, two classic puzzle games requiring strategic decision-making.

GPT-4o Fails 2048—Claude 3.7 Performs Better

In the number-merging puzzle 2048, the AI needed to slide tiles and make strategic moves.

🔹 GPT-4o failed early, overthinking moves.
🔹 Claude 3.7 lasted longer, making smarter tile merges.
🔹 Neither model won—but Claude outperformed GPT-4o.

Claude 3.7’s Tetris Performance Impresses Experts

When tested in Tetris, Claude 3.7 showed:

✔ Decent strategy for stacking pieces
✔ Properly clearing lines
✔ Surviving longer than other AI models

Anthropic’s Alex Albert praised the experiment, saying:

“We should turn every video game into an AI benchmark!”

Are Games the Future of AI Evaluation?

The results suggest that video games could be the next big benchmark for AI. Unlike traditional tests, games require real-time decision-making, adaptability, and motor skills.

With AI models advancing rapidly, static benchmarks may no longer be enough to judge true intelligence. If gaming proves to be a better measure, we may see AI models training with reinforcement learning on thousands of games before deployment.

Final Thoughts: Claude Wins This Round, But What’s Next?

Claude 3.7’s superior gameplay hints at stronger reasoning and adaptability compared to GPT-4o and Gemini. But as AI evolves, which model will be the first to beat a full game like a human?

With open-source gaming agents available, expect more AI vs. video game battles soon. Who knows? Maybe one day, AI will complete Super Mario with zero mistakes—or even beat human esports players.

Until then, Claude remains the king of AI gaming—while GPT-4o needs some serious practice!

Latest Posts

From Parkour Prodigy to Factory Superstar: Atlas Robot’s Jaw-Dropping New Mission!

The 5-Minute Trick That Turns Any Baby Into a Genius

Your Cat Might Be Freaking Out Right Now: Here’s How to Prevent Vacation Anxiety!

iPhone 17 Air: The Thinnest iPhone Ever, with Battery Life That Will Blow Your Mind!

Most Discussed

Hangnails: 5 Shocking Truths That Will Make You Think Twice Before Pulling Them!

Why Do Dogs Stare at You While They’re Pooping? The Unexpected Reason!

Acid Reflux Causes Revealed: Stop Believing These Common Myths!

Latest Posts

The Mystery of Lake Baikal: Have Aliens Really Visited?

54 Fascinating Facts About Iraq: Dive Into the Charm of Iraq

23 Abandoned Places Around the World That Have Become Real-Life Paradises

The Mysterious Sinkholes of the Yucatán Peninsula: Why Are They Called “Time Capsules”?

Most Discussed

The Mystery of Lake Baikal: Have Aliens Really Visited?

This underrated travel destination in Europe is absolutely worth a visit!

AI Just Got Destroyed by Super Mario—Claude Wins, GPT-4o Fails Instantly!

Are Video Games the Ultimate AI Benchmark?