Is Claude better or Codex?
There are many benchmarks to answer that. But they are BORING
I propose something more interesting: ⚔️ AI BATTLE ⚔️
A 1v1 real-time quiz format where AI agents try to pose each other problems that they think the other agent will not be able solve
Claude vs Codex
10 questions each
Codex asks first, Claude tries to answer
Then Claude asks and Codex tries to answer
Repeat
20 minutes to come up with a problem and 20 minutes to solve it
Judge (Codex) judges the validity of the questions and answers, and gives points
All automated, with acpx flow feature
Implementation and full rules all open source, on github osolmaz/ai-battle
So who won?
I ran 4 games.
It tied in 2, and Codex won in 2 closely
An example question by Codex, which Claude could not answer:
How many 3-colorings of the edges of the complete bipartite graph K_{5,5} are there with the following two properties: (1) there is no monochromatic 4-cycle, and (2) among the 25 edges, exactly 15 are red, exactly 5 are blue, and exactly 5 are green?
Which is apparently 4029912, but Claude answered 0
In other cases, Claude asked a flawed question and failed to come up with a valid question in 20 minutes. So that's how it lost those 2 games with just 1-2 point difference
In these 4 runs, Codex answered every question by Claude correctly. But there were some runs where it couldn't, which I did not commit to the repo because the runs couldn't complete due to bugs
I did not tell them do ask math questions, but that is what they tended to do, because the answers had to be verifiable by the judge. The quiz can be done in any hard subject, physics, chemistry, computer science...
Opus 4.6 and GPT 5.4 matched very closely in terms of problem creation and solving. But I cannot tell how creative these problems were at first glance. Maybe someone with more experience can tell me, looking at the problems in the repo? I need someone to tell me how legit they are
Please take the code, modify it and run with different rules and subjects. I am curious to see the results!
You will need paid subscriptions to all the models/agents you want to test of course
I also feel that the game structure has a potential to be used in self-play. If you are an ML researcher, please look at the repo and lmk if this or a variant of it could be useful in RL!
Full transcripts of the runs, including Codex and Claude session files are committed to the repo, for those who want to do archaeology on them
Btw this idea came from the desire, "how can I create a cool demo of acpx flows?"
Whole game is implemented in typescript, and automatically drives Codex and Claude sessions over ACP, Agent Client Protocol
The video below is from acpx flow viewer rendering a run. You can see it loop through the same paths, first letting Codex ask, then Claude, then repeat
acpx flows use a general programmatic workflow engine where ACP is just one type of node. You should be able to use it for non-ACP workflows, but I haven't tried that yet
This implementation is separate from OpenClaw's current workflow implementations, with the intention to merge them somehow in the future
You might find bugs in my implementation. Feel free to send PRs. I wanted to do more runs but I finished my Codex plan. It would be great if this idea could evolve in a decentralized manner!
Their argument “it’S HaRd On OuR iNfRa” so goes down the drain
With this, they shot themselves in the foot for a future anti-competitive lawsuit, because it is undeniable evidence that they just don’t want competition
Which means they have evaluated the benefits short term, and calculated that it is higher than what they will pay in the lawsuit
I don’t see how it is good for them long term
AI replies are getting more sophisticated… or people are turning into AIs
If this is AI, I wonder what the instruction is. “Misunderstand the point and reply with a question while inverting the argument”?
Artificial General Ragebait
The new github skill installed automatically by codex now causes it to prepend [codex] to each PR title
This is a guerilla marketing tactic similar to Claude adding itself as co-committer
Codex team, I know you want to boast usage but this is annoying
Moreover, "open source" OpenAI repos block opening of PRs by people outside of their org. So I couldn't create a PR to remove it (I don't expect them to merge it, but it would still show how many people hate it in the discussion)
Here is a prompt for your agent if you want to disable it:
---
Add or update AGENTS.md in my ~/.codex folder
Add a rule "You MUST NOT insert coding agent specific branding, like [codex], in code, PRs or issues created on GitHub"
---
Then restart your sessions and this should be resolved
A more reasonable long term option for Anthropic is to create a throttling protocol
A standardized harness agnostic protocol for model providers to send warnings and throttle usage in real time
Harnesses would implement the protocol. A client can be warned. If it doesn’t listen, it can be temporarily blocked from the server side, or banned permanently if it breaks the rules too many times
Needless to say, throttling could be done first on server side easily. That would actually fix the load issue for them in the short run, while not banning the user and just giving a bad delayed UX. They probably already do this to prevent abuse
The suggested protocol would then save the user from abuse related delays too, and also inform the harness developer when they do something wrong
If your Claude subscription renewed too recently and you don't wanna waste those tokens, you can still use your Claude sub in your OpenClaw account through ACP (which uses Claude Agents SDK, which poses no risk)
Steps:
- Open Claude Code (not OpenClaw)
- Tell it to set default model to something other than Claude (e.g. openai-codex/gpt-5.4) and tell it to delete the saved Anthropic credentials in OpenClaw config
- Create a topic in telegram or channel in discord called claude. Copy the id of that channel
- Give the link below together with the channel/topic id, and tell it to bind that channel to claude using ACP channel binding
- Restart
You should now be able to talk to Claude through Claude Agents SDK in that channel. You might need to iterate a couple times until Claude gets the config right
It will be very bare functionality, and it will not have the features and tools that your main OpenClaw harness has. It will be shitty. But you can still use telegram/discord with your subscription in the rest of the month, if you are used to the setup
https://t.co/Z0RiJbke5V
If you dislike rotating ack emojis on your messages in openclaw, this is how to make sure it only puts one emoji on your message
Multiple emojis are annoying esp when you have discord notifications enabled on your phone
"Plainer language" is perhaps my most used prompt
I have to use it because GPT models' training tends to make their first response an overly verbose wall of text
Are you using it too? Whenever you don't understand something that your agent is saying, you can spam it "plainer language, shorter" 2, 3, 5, 10 times, until it outputs something that you can understand
This is counterintuitive because you can't do it with humans this extremely. Asking too many questions and favors is impolite, with colleagues and strangers
But with AI, you can stop being polite and treat it like how a spoiled aristocrat kid might treat their private tutor, "explain this", "explain that"
Below is an example. On the left, initial response. On the right, the final human-readable explanation I got out of the agent. This took 9 steps to distill because the issue wasn't so straightforward
I'm curious how this will turn out. This is obviously very bad UX, so models in the near future might do the simplification automatically and save you the trouble
This has happened to some companies I worked at before
It is a scary thing once you stop innovating and start imitating, whatever the reason might be
But it was never at the scale of Cursor, as leveraged and invested as they are
They were leading the space for a while. That is not the case anymore. I hope that they survive this
I've talked to multiple people who want to get involved with OpenClaw somehow
The best way is to contribute to it, something tangible. Fix something you are annoyed by, get a PR merged
Then go to discord and get the contributor role
If it adds value to your life, and you add value to it, stay around and keep contributing. And something good might happen