X Archive

@onusoz · /2026/04/06· 04:11 PM View on

Is Claude better or Codex? There are many benchmarks to answer that. But they are BORING I propose something more interesting: ⚔️ AI BATTLE ⚔️ A 1v1 real-time quiz format where AI agents try to pose each other problems that they think the other agent will not be able solve Claude vs Codex 10 questions each Codex asks first, Claude tries to answer Then Claude asks and Codex tries to answer Repeat 20 minutes to come up with a problem and 20 minutes to solve it Judge (Codex) judges the validity of the questions and answers, and gives points All automated, with acpx flow feature Implementation and full rules all open source, on github osolmaz/ai-battle So who won? I ran 4 games. It tied in 2, and Codex won in 2 closely An example question by Codex, which Claude could not answer: How many 3-colorings of the edges of the complete bipartite graph K_{5,5} are there with the following two properties: (1) there is no monochromatic 4-cycle, and (2) among the 25 edges, exactly 15 are red, exactly 5 are blue, and exactly 5 are green? Which is apparently 4029912, but Claude answered 0 In other cases, Claude asked a flawed question and failed to come up with a valid question in 20 minutes. So that's how it lost those 2 games with just 1-2 point difference In these 4 runs, Codex answered every question by Claude correctly. But there were some runs where it couldn't, which I did not commit to the repo because the runs couldn't complete due to bugs I did not tell them do ask math questions, but that is what they tended to do, because the answers had to be verifiable by the judge. The quiz can be done in any hard subject, physics, chemistry, computer science... Opus 4.6 and GPT 5.4 matched very closely in terms of problem creation and solving. But I cannot tell how creative these problems were at first glance. Maybe someone with more experience can tell me, looking at the problems in the repo? I need someone to tell me how legit they are Please take the code, modify it and run with different rules and subjects. I am curious to see the results! You will need paid subscriptions to all the models/agents you want to test of course I also feel that the game structure has a potential to be used in self-play. If you are an ML researcher, please look at the repo and lmk if this or a variant of it could be useful in RL! Full transcripts of the runs, including Codex and Claude session files are committed to the repo, for those who want to do archaeology on them Btw this idea came from the desire, "how can I create a cool demo of acpx flows?" Whole game is implemented in typescript, and automatically drives Codex and Claude sessions over ACP, Agent Client Protocol The video below is from acpx flow viewer rendering a run. You can see it loop through the same paths, first letting Codex ask, then Claude, then repeat acpx flows use a general programmatic workflow engine where ACP is just one type of node. You should be able to use it for non-ACP workflows, but I haven't tried that yet This implementation is separate from OpenClaw's current workflow implementations, with the intention to merge them somehow in the future You might find bugs in my implementation. Feel free to send PRs. I wanted to do more runs but I finished my Codex plan. It would be great if this idea could evolve in a decentralized manner!

@onusoz · /2026/04/06· 04:11 PM View on

Is Claude better or Codex? There are many benchmarks to answer that. But they are BORING I propose something more interesting: ⚔️ AI BATTLE ⚔️ A 1v1 real-time quiz format where AI agents try to pose each other problems that they think the other agent will not be able solve Claude vs Codex 10 questions each Codex asks first, Claude tries to answer Then Claude asks and Codex tries to answer Repeat 20 minutes to come up with a problem and 20 minutes to solve it Judge (Codex) judges the validity of the questions and answers, and gives points All automated, with acpx flow feature Implementation and full rules all open source, on github osolmaz/ai-battle So who won? I ran 4 games. It tied in 2, and Codex won in 2 closely An example question by Codex, which Claude could not answer: How many 3-colorings of the edges of the complete bipartite graph K_{5,5} are there with the following two properties: (1) there is no monochromatic 4-cycle, and (2) among the 25 edges, exactly 15 are red, exactly 5 are blue, and exactly 5 are green? Which is apparently 4029912, but Claude answered 0 In other cases, Claude asked a flawed question and failed to come up with a valid question in 20 minutes. So that's how it lost those 2 games with just 1-2 point difference In these 4 runs, Codex answered every question by Claude correctly. But there were some runs where it couldn't, which I did not commit to the repo because the runs couldn't complete due to bugs I did not tell them do ask math questions, but that is what they tended to do, because the answers had to be verifiable by the judge. The quiz can be done in any hard subject, physics, chemistry, computer science... Opus 4.6 and GPT 5.4 matched very closely in terms of problem creation and solving. But I cannot tell how creative these problems were at first glance. Maybe someone with more experience can tell me, looking at the problems in the repo? I need someone to tell me how legit they are Please take the code, modify it and run with different rules and subjects. I am curious to see the results! You will need paid subscriptions to all the models/agents you want to test of course I also feel that the game structure has a potential to be used in self-play. If you are an ML researcher, please look at the repo and lmk if this or a variant of it could be useful in RL! Full transcripts of the runs, including Codex and Claude session files are committed to the repo, for those who want to do archaeology on them Btw this idea came from the desire, "how can I create a cool demo of acpx flows?" Whole game is implemented in typescript, and automatically drives Codex and Claude sessions over ACP, Agent Client Protocol The video below is from acpx flow viewer rendering a run. You can see it loop through the same paths, first letting Codex ask, then Claude, then repeat acpx flows use a general programmatic workflow engine where ACP is just one type of node. You should be able to use it for non-ACP workflows, but I haven't tried that yet This implementation is separate from OpenClaw's current workflow implementations, with the intention to merge them somehow in the future You might find bugs in my implementation. Feel free to send PRs. I wanted to do more runs but I finished my Codex plan. It would be great if this idea could evolve in a decentralized manner!

@onusoz · /2026/04/06· 07:26 AM View on

Their argument “it’S HaRd On OuR iNfRa” so goes down the drain With this, they shot themselves in the foot for a future anti-competitive lawsuit, because it is undeniable evidence that they just don’t want competition Which means they have evaluated the benefits short term, and calculated that it is higher than what they will pay in the lawsuit I don’t see how it is good for them long term

@theo· Apr 5, 2026

Latest Claude docs update is wild

Image hidden

@onusoz · /2026/04/05· 07:19 PM View on

AI replies are getting more sophisticated… or people are turning into AIs If this is AI, I wonder what the instruction is. “Misunderstand the point and reply with a question while inverting the argument”? Artificial General Ragebait

Image hidden

@onusoz · /2026/04/05· 11:59 AM View on

The new github skill installed automatically by codex now causes it to prepend [codex] to each PR title This is a guerilla marketing tactic similar to Claude adding itself as co-committer Codex team, I know you want to boast usage but this is annoying Moreover, "open source" OpenAI repos block opening of PRs by people outside of their org. So I couldn't create a PR to remove it (I don't expect them to merge it, but it would still show how many people hate it in the discussion) Here is a prompt for your agent if you want to disable it: --- Add or update AGENTS.md in my ~/.codex folder Add a rule "You MUST NOT insert coding agent specific branding, like [codex], in code, PRs or issues created on GitHub" --- Then restart your sessions and this should be resolved

Image hidden

@onusoz · /2026/04/04· 09:40 PM View on

A more reasonable long term option for Anthropic is to create a throttling protocol A standardized harness agnostic protocol for model providers to send warnings and throttle usage in real time Harnesses would implement the protocol. A client can be warned. If it doesn’t listen, it can be temporarily blocked from the server side, or banned permanently if it breaks the rules too many times Needless to say, throttling could be done first on server side easily. That would actually fix the load issue for them in the short run, while not banning the user and just giving a bad delayed UX. They probably already do this to prevent abuse The suggested protocol would then save the user from abuse related delays too, and also inform the harness developer when they do something wrong

@onusoz· Apr 4, 2026

A valid argument is that a third party harness might not be optimized well enough as claude code and agents sdk, which causes unnecessary load on their infra and prevent them from serving all their users efficiently But when a company like Anthropic is so tight on infra capacity as they might be, they could rightfully block 3rd party harnesses in order to remain operational Because they can implement real time throttling mechanisms into their own harness, to be able to absorb demand shocks For example, they could send a payload to all claude code users instantaneously to switch to a low usage mode, to curb requests But if half of their usage comes from harnesses not under their control, they can't do that and have to suffer outages more I am not sure if this really is the case because I am not an employee. But this is what they have been hinting at for some time Watch after 11:30 This was about opencode and I am not sure if opencode still has unoptimized prompt cache https://t.co/NYXJgOHein

@onusoz · /2026/04/04· 03:21 PM View on

If your Claude subscription renewed too recently and you don't wanna waste those tokens, you can still use your Claude sub in your OpenClaw account through ACP (which uses Claude Agents SDK, which poses no risk) Steps: - Open Claude Code (not OpenClaw) - Tell it to set default model to something other than Claude (e.g. openai-codex/gpt-5.4) and tell it to delete the saved Anthropic credentials in OpenClaw config - Create a topic in telegram or channel in discord called claude. Copy the id of that channel - Give the link below together with the channel/topic id, and tell it to bind that channel to claude using ACP channel binding - Restart You should now be able to talk to Claude through Claude Agents SDK in that channel. You might need to iterate a couple times until Claude gets the config right It will be very bare functionality, and it will not have the features and tools that your main OpenClaw harness has. It will be shitty. But you can still use telegram/discord with your subscription in the rest of the month, if you are used to the setup https://t.co/Z0RiJbke5V

@onusoz · /2026/04/04· 10:26 AM View on

If you dislike rotating ack emojis on your messages in openclaw, this is how to make sure it only puts one emoji on your message Multiple emojis are annoying esp when you have discord notifications enabled on your phone

Image hidden

@onusoz · /2026/04/03· 03:23 PM View on

"Plainer language" is perhaps my most used prompt I have to use it because GPT models' training tends to make their first response an overly verbose wall of text Are you using it too? Whenever you don't understand something that your agent is saying, you can spam it "plainer language, shorter" 2, 3, 5, 10 times, until it outputs something that you can understand This is counterintuitive because you can't do it with humans this extremely. Asking too many questions and favors is impolite, with colleagues and strangers But with AI, you can stop being polite and treat it like how a spoiled aristocrat kid might treat their private tutor, "explain this", "explain that" Below is an example. On the left, initial response. On the right, the final human-readable explanation I got out of the agent. This took 9 steps to distill because the issue wasn't so straightforward I'm curious how this will turn out. This is obviously very bad UX, so models in the near future might do the simplification automatically and save you the trouble

@onusoz· Mar 24, 2026

This is unscientific, but there are certain keywords and phrases I use a lot while using certain models like openai's. I use them a lot because they get me what I want immediately: - plainer lang - cutover - elegant and production ready - holy grail What are yours?

Image hidden

@onusoz · /2026/04/03· 12:26 PM View on

This has happened to some companies I worked at before It is a scary thing once you stop innovating and start imitating, whatever the reason might be But it was never at the scale of Cursor, as leveraged and invested as they are They were leading the space for a while. That is not the case anymore. I hope that they survive this