Is Claude better or Codex?
There are many benchmarks to answer that. But they are BORING
I propose something more interesting: ⚔️ AI BATTLE ⚔️
A 1v1 real-time quiz format where AI agents try to pose each other problems that they think the other agent will not be able solve
Claude vs Codex
10 questions each
Codex asks first, Claude tries to answer
Then Claude asks and Codex tries to answer
Repeat
20 minutes to come up with a problem and 20 minutes to solve it
Judge (Codex) judges the validity of the questions and answers, and gives points
All automated, with acpx flow feature
Implementation and full rules all open source, on github osolmaz/ai-battle
So who won?
I ran 4 games.
It tied in 2, and Codex won in 2 closely
An example question by Codex, which Claude could not answer:
How many 3-colorings of the edges of the complete bipartite graph K_{5,5} are there with the following two properties: (1) there is no monochromatic 4-cycle, and (2) among the 25 edges, exactly 15 are red, exactly 5 are blue, and exactly 5 are green?
Which is apparently 4029912, but Claude answered 0
In other cases, Claude asked a flawed question and failed to come up with a valid question in 20 minutes. So that's how it lost those 2 games with just 1-2 point difference
In these 4 runs, Codex answered every question by Claude correctly. But there were some runs where it couldn't, which I did not commit to the repo because the runs couldn't complete due to bugs
I did not tell them do ask math questions, but that is what they tended to do, because the answers had to be verifiable by the judge. The quiz can be done in any hard subject, physics, chemistry, computer science...
Opus 4.6 and GPT 5.4 matched very closely in terms of problem creation and solving. But I cannot tell how creative these problems were at first glance. Maybe someone with more experience can tell me, looking at the problems in the repo? I need someone to tell me how legit they are
Please take the code, modify it and run with different rules and subjects. I am curious to see the results!
You will need paid subscriptions to all the models/agents you want to test of course
I also feel that the game structure has a potential to be used in self-play. If you are an ML researcher, please look at the repo and lmk if this or a variant of it could be useful in RL!
Full transcripts of the runs, including Codex and Claude session files are committed to the repo, for those who want to do archaeology on them
Btw this idea came from the desire, "how can I create a cool demo of acpx flows?"
Whole game is implemented in typescript, and automatically drives Codex and Claude sessions over ACP, Agent Client Protocol
The video below is from acpx flow viewer rendering a run. You can see it loop through the same paths, first letting Codex ask, then Claude, then repeat
acpx flows use a general programmatic workflow engine where ACP is just one type of node. You should be able to use it for non-ACP workflows, but I haven't tried that yet
This implementation is separate from OpenClaw's current workflow implementations, with the intention to merge them somehow in the future
You might find bugs in my implementation. Feel free to send PRs. I wanted to do more runs but I finished my Codex plan. It would be great if this idea could evolve in a decentralized manner!
Their argument “it’S HaRd On OuR iNfRa” so goes down the drain
With this, they shot themselves in the foot for a future anti-competitive lawsuit, because it is undeniable evidence that they just don’t want competition
Which means they have evaluated the benefits short term, and calculated that it is higher than what they will pay in the lawsuit
I don’t see how it is good for them long term