Post
Introducing AI Battle
Is Claude better or Codex? There are many benchmarks to answer that. But they are BORING I propose something more interesting: ⚔️ AI BATTLE ⚔️ A 1v1 real-time quiz format where AI agents try to pose each other problems that they think the other agent will not be able solve Claude vs Codex 10 questions each Codex asks first, Claude tries to answer Then Claude asks and Codex tries to answer Repeat 20 minutes to come up with a problem and 20 minutes to solve it Judge (Codex) judges the validity of the questions and answers, and gives points All automated, with acpx flow feature Implementation and full rules all open source, on github osolmaz/ai-battle So who won? I ran 4 games. It tied in 2, and Codex won in 2 closely An example question by Codex, which Claude could not answer: How many 3-colorings of the edges of the complete bipartite graph K_{5,5} are there with the following two properties: (1) there is no monochromatic 4-cycle, and (2) among the 25 edges, exactly 15 are red, exactly 5 are blue, and exactly 5 are green? Which is apparently 4029912, but Claude answered 0 In other cases, Claude asked a flawed question and failed to come up with a valid question in 20 minutes. So that's how it lost those 2 games with just 1-2 point difference In these 4 runs, Codex answered every question by Claude correctly. But there were some runs where it couldn't, which I did not commit to the repo because the runs couldn't complete due to bugs I did not tell them do ask math questions, but that is what they tended to do, because the answers had to be verifiable by the judge. The quiz can be done in any hard subject, physics, chemistry, computer science... Opus 4.6 and GPT 5.4 matched very closely in terms of problem creation and solving. But I cannot tell how creative these problems were at first glance. Maybe someone with more experience can tell me, looking at the problems in the repo? I need someone to tell me how legit they are Please take the code, modify it and run with different rules and subjects. I am curious to see the results! You will need paid subscriptions to all the models/agents you want to test of course I also feel that the game structure has a potential to be used in self-play. If you are an ML researcher, please look at the repo and lmk if this or a variant of it could be useful in RL! Full transcripts of the runs, including Codex and Claude session files are committed to the repo, for those who want to do archaeology on them Btw this idea came from the desire, "how can I create a cool demo of acpx flows?" Whole game is implemented in typescript, and automatically drives Codex and Claude sessions over ACP, Agent Client Protocol The video below is from acpx flow viewer rendering a run. You can see it loop through the same paths, first letting Codex ask, then Claude, then repeat acpx flows use a general programmatic workflow engine where ACP is just one type of node. You should be able to use it for non-ACP workflows, but I haven't tried that yet This implementation is separate from OpenClaw's current workflow implementations, with the intention to merge them somehow in the future You might find bugs in my implementation. Feel free to send PRs. I wanted to do more runs but I finished my Codex plan. It would be great if this idea could evolve in a decentralized manner!