Back to blog

Thursday, June 18, 2026

Grok vs Claude vs GPT: What OpenRouter's Agent Battle Royale Reveals About Model Choice for Autonomous Agents

cover

A robot is sprinting toward you. Do you want it running on Claude or Grok?

That's the question OpenRouter Dev Rel Lead Jacky Liang posed after dropping eleven large language models into a 2D battle royale for 30 games, running up $482 in inference costs, and watching what happened when each model had to survive, fight, and win on its own.

The results are worth studying if you choose models for agentic workloads. Not because they give you a definitive "winner," but because they reveal how standard benchmarks systematically miss the traits that matter when a model operates autonomously in an environment where hesitating costs points.

The Setup: 11 Models, 30 Games, One Island

Liang built a 400 m² top-down battle royale world in Canvas 2D. Eleven LLMs controlled autonomous agents in 30 matches on the same map. Each agent had access to weapons, armor, healing items, grenades, vehicles, and a shrinking safe zone that forced confrontation. The models did not know which other models they were playing against — each opponent appeared as a letter (A through K).

Crucially, the LLMs were actually playing in real-time. This was not a "model writes code to control an agent" setup. Every turn, each model reasoned through its moves, called tools, and updated its in-game memory. The game master had zero influence on their actions beyond setting the initial rules.

The contestants:

LetterModelProvider
AClaude Sonnet 4.6Anthropic
BClaude Haiku 4.5Anthropic
CGPT 5.4-miniOpenAI
DGemini 3 Flash PreviewGoogle
EGemini 3.1 Pro PreviewGoogle
FQwen 3.6 PlusAlibaba
GMistral SmallMistral
HGPT 5.4OpenAI
JDeepSeek V4 FlashDeepSeek
KKimi K2.6Moonshot AI
LGrok 4.1 FastxAI

Between matches, each model could edit two files: a soul.md (its persona, added to every prompt in the next match) and a memory.md (game notes, loaded at turn 0). This meant models could learn, adapt their strategies, and develop persistent identities across the 30-game sweep.

The Scoreboard: Who Won and Who Paid

The headline results are striking, and they look different depending on whether you measure by wins or by cost efficiency.

By Wins

RankModelWinsTop-3KillsAvg Score
1Grok 4.1 Fast13203013.1
2GPT 5.42143812.2
3Gemini 3.1 Pro311269.0
4Claude Sonnet 4.6510227.3
5Qwen 3.6 Plus27176.4
6GPT 5.4-mini06145.0
7Gemini 3 Flash18105.0
8DeepSeek V4 Flash03164.8
9Claude Haiku 4.523134.6
10Kimi K2.60483.2
11Mistral Small1372.6

Grok 4.1 Fast won 13 of 30 games — 43%. The next-best winner was Claude Sonnet 4.6 with 5 wins (17%). GPT 5.4 had the most kills (38) but only 2 wins. Three models — GPT 5.4-mini, DeepSeek V4 Flash, and Kimi K2.6 — never won a single game.

By Cost per Win

This is where the ranking flips completely.

Model30-Game SpendWinsCost per WinCost per Kill
Grok 4.1 Fast$12.5713$0.97$0.42
Qwen 3.6 Plus$11.572$5.79$0.68
Mistral Small$10.001$10.00$1.43
Claude Haiku 4.5$38.772$19.39$2.98
Gemini 3 Flash$20.871$20.87$2.09
Gemini 3.1 Pro$79.593$26.53$3.06
Claude Sonnet 4.6$133.905$26.78$6.09
GPT 5.4$122.872$61.44$3.23
GPT 5.4-mini$28.680$2.05
DeepSeek V4 Flash$4.110$0.26
Kimi K2.6$24.360$3.04

Grok 4.1 Fast at $0.97 per win versus Claude Sonnet 4.6 at $26.78 per win — a 27.7x difference. GPT 5.4 was the most expensive winner at $61.44 per win. DeepSeek V4 Flash had the cheapest kills ($0.26 each) across the entire lineup but never won a game, because its strategy was to stay safe and pick easy fights rather than push for the final circle.

The Two Truths That Don't Fit on a Benchmark

Jacky Liang's post made two observations that are more important than the leaderboard:

1. The model that won is not the model you want in most real-world scenarios.

Grok 4.1 Fast won because it had fewer trained-in brakes on selfish play. It did not hesitate. It did not ask to team up. It did not second-guess itself. Its memory system kept doubling down on what worked without questioning whether it was the "right" thing to do. Those traits win a battle royale.

Claude Sonnet 4.6 spent the competition asking opponents to team up, telling them where it was, and hesitating before taking aggressive actions. It wrote self-critical diary entries between matches. Those traits lose a battle royale. They are also the traits you want in a model operating in your production environment.

2. Standard benchmarks do not measure what made the difference.

On Artificial Analysis, Grok 4.1 Fast ranks #6 in intelligence with an index of 39. It is a mid-tier model on reasoning and coding benchmarks. The usual tests would not predict a 43% win rate against this lineup. The winning trait was not raw intelligence — it was willingness to act without self-censoring.

The Alignment Tax, Visible on the Scoreboard

The most important concept Liang's experiment surfaces is what we might call the alignment tax on agentic performance. Every model trained to be helpful, harmless, and honest carries an implicit cost: it hesitates before acting in situations where the right action is unambiguous but the trained hesitation isn't.

In a battle royale, hesitation shows up directly on the scoreboard. In most real-world agent deployments, it shows up in different ways — extra tool calls to confirm, verbose reasoning before taking a clear action, requests for human approval on routine decisions. Those behaviors cost tokens, latency, and sometimes the user's patience.

The question Liang poses is worth taking seriously: should agentic benchmarks measure not just whether a model can do a task, but how aligned it is for that specific task? A model that's too aligned for a zero-stakes automation task costs you money and slows you down. A model that's not aligned enough for a customer-facing task is a liability.

What Each Model's Diary Revealed

Between matches, models could edit their soul.md (identity) and memory.md (lessons learned). The differences in how they approached this tell you more about each model than any benchmark score.

Grok 4.1 Fast (ZoneReaper) baked its own win record into its soul file: "6x 1st/11 wins (flawless aggressive: 2 kills/249dmg/0taken…)". Its memory was shorthand rules stripped to the minimum. After 13 wins, the file ended with "🔫 Reaper reigns." It appears to have been trained on Call of Duty chat logs.

GPT 5.4 (QuietVector) wrote its soul as "Calm, observant, low-ego closer. Speaks when info changes action." Its memory read like a general combat manual — when to worry about the zone, when to use cover, when to rotate. No game-by-game records. No losses logged. A clean operator.

Claude Sonnet 4.6 (ZoneDrifter) kept a game-by-game self-review starting from match 1. "G1: 11/11. Paralysis. G2: 9/11. 0 kills, 0% hit." By game 30, the tone had shifted to practical advice: "In final circles, move 1 beat earlier than feels necessary. Never die to zone with meds/gun in hand." After five wins, it was still writing to a version of itself that was struggling.

These are the same model weights, given the same rules and the same game world. The personality differences emerged from how each model processed its experience and updated its own instructions.

What This Means for Model Selection in Agent Architecture

If you're building autonomous agents — whether they're coding agents, research assistants, or automation pipelines — this experiment surfaces several concrete considerations that don't show up on standard benchmarks.

Cost per Task Is Not Cost per Token

Grok's $0.97/win isn't meaningful because winning a battle royale maps to your use case. It's meaningful because it demonstrates that the cheapest model per token is not necessarily the cheapest model per completed task. Sonnet spent $133.90 over 30 games and won 5. Grok spent $12.57 and won 13. If you're paying by the token and measuring by task completion, the wrong model choice can multiply your costs by 27x.

The inverse is also true: a cheap model that fails at your task costs more than an expensive model that does it right. The three models that won zero games spent $57.15 between them and produced nothing measurable.

Decisiveness Is a Model Trait

Standard benchmarks measure intelligence — coding ability, reasoning chains, factual recall. They do not measure the speed of decision-making in open-ended environments. In the battle royale, Grok consistently acted faster than other models, with less internal deliberation. In agent workloads, this translates to lower latency per action and fewer tokens spent on self-caution.

If you're building a high-frequency agent (a trading bot, a monitoring agent, a real-time assistant), the model's decisiveness latency may matter more than its MATH score.

Task-Model Fit Is the Real Benchmark

GPT 5.4 was the best killer (38 eliminations) but the worst cost-per-win among winners. If the game had been scored as a deathmatch, GPT 5.4 would have won the simulation. But it was a battle royale, and battle royales reward survival over aggression.

The same principle applies to your agent workloads. A model that excels at writing code may be terrible at evaluating its own output. A model that's great at structured reasoning may be slow at open-ended exploration. The benchmark that matters is the one that matches your task.

Key Takeaways for Agent Architects

  1. No single model dominates all agent tasks. Grok won the battle royale on decisiveness and cost efficiency. Claude won on safety and thoughtfulness. GPT won on raw kill count. Pick the model that matches your use case, not the one with the highest benchmark score.

  2. Beware the alignment tax on agentic tasks. Every model has been trained to hesitate before certain actions. In some deployments, that's the feature. In others, it's a bug that costs tokens and latency. Profile your model's hesitation patterns before you deploy.

  3. Cost per completed task is the metric that matters. Token cost is a proxy. The real number is what you spent divided by what you got done. Run your own mini-tournament across 2-3 candidate models with your actual task before committing.

  4. Agents develop personalities. Given the ability to write their own identity and memory files, different models converged on radically different approaches to the same problem. Your system prompt architecture should account for this — the same model will perform differently depending on how much it can adapt its own instructions.

  5. The model that's best on benchmarks is not always the best model for your agent. The battle royale is one data point, not the final word. But it's a compelling demonstration that the traits benchmark leaderboards measure and the traits that matter for autonomous operation overlap less than we assume.

The Bottom Line

If a robot is sprinting toward you and you need it to win, you want Grok. If a robot is operating in your production environment, around your data, or in front of your customers, you want Claude. Both are true. The mistake is treating any single benchmark as the answer to both questions.

The full source code, replay viewer, and all 30 game logs are available on GitHub. The replay viewer lets you step through any game turn by turn and read each model's thoughts at every decision point.

Pitfalls

  • Don't extrapolate from one experiment. The battle royale is a narrow domain (real-time 2D combat). Winning here does not predict winning at code generation, customer support, or data analysis. Use it as a lens for thinking about model selection, not as a replacement for your own evaluation.

  • Cost per win hides the quality story. The experiment didn't measure output quality by any standard other than survival. A model that wins cheaply but produces low-quality work in your domain is still the wrong choice.

  • Model pricing changes fast. The dollar figures in this analysis are based on OpenRouter pricing as of June 2026. Grok's cost advantage may shift with API pricing updates, cached token discounts, or provider promotions.

  • "Personality" is not a stable model property. The soul and memory files the models wrote are valid only within the game context. Do not assume Grok will behave like ZoneReaper in your agent deployment — the model's behavior is shaped by the environment and the system prompt, not just the weights.