Entries for July 2026

@onusoz · /2026/07/23 · 12:58 PM View on

Laguna S 2.1 scored really low on AlmanBench, even lower than Ternary Bonsai 27B A 118B, 8B active parameter model, scored lower in FB8 precision, than a ternary quantized 27B model 🤨 Did the pretraining mix lack multilingual data by a lot? 🤔 alman.ai/almanbench

Image hidden

@onusoz · /2026/07/23 · 08:16 AM View on

Dear Pi and OpenClaw users If you want to support pi and openclaw with branding, these extensions will make them appear as co-committers like how claude does: pi install npm:pi-must-win openclaw plugins install clawhub:openclaw-must-win Or just copy/paste this to your agent

Image hidden

@onusoz · /2026/07/22 · 06:40 PM View on

Ran Laguna S 2.1 *theoretical* (not measured) upper bound calculations for the DGX spark poolside/Laguna-S-2.1-NVFP4 - max ~25 tok/s single session. if you are willing to go down to 12 tok/s, you can have 4 sessions in parallel vcruz305/Laguna-S-2.1-GGUF (IQ1_S) - max ~74 tok/s single session. theoretical 5x20 tok/s = 100 tok/s aggregate upper bound Calculated assuming no speculative decoding, so might increase well over these numbers See my local frontier app and my previous posts for more info on the methodology

@sudoingX · Jul 21, 2026

this is the drop the local ai crowd should be losing their minds over. poolside just dropped laguna s 2.1: 118b total parameters, only 8b active per token, a full 1m context window, open weights under a real open license, on huggingface today. look at the chart. it lands at 71 on terminal-bench at 118b, sitting above deepseek v4 pro max at a trillion params, above inkling at 1.5 trillion, above nemotron 3 ultra. it's beating models ten times its size and losing only to kimi k3, which is 24 times bigger. that's the efficiency frontier, up and to the left, exactly where you want a model to sit. but here's the part that made me sit up: it runs on a single dgx spark. and this is what nobody's saying loud enough. the dgx spark is the moe king. a dense 118b would crawl on it, the bandwidth chokes reading every weight each token. a moe with 8b active only ever reads 8b, so the spark's 128 gigs holds the whole model while generation stays fast. big brain, light footprint, the exact shape the spark was built to run. open, frontier competitive, moe efficient, and it fits on a box on your desk. that's the whole thesis in one release: you don't need a datacenter, you need the right architecture on the right hardware. go grab the link below, weights are up.

Image hidden

@onusoz · /2026/07/22 · 06:17 PM View on

Repos: github.com/osolmaz/opencl… github.com/osolmaz/pi-mus…

@onusoz · /2026/07/22 · 06:16 PM View on

There is an inherent unfairness in this industry If you are Anthropic, Cursor or another billion dollar company, you can do growth hacks and guerrilla marketing techniques like putting your brand into every commit of a user, and get away with it But if a tool like pi, opencode or even say vim or emacs tried to do that, they would be boo'ed Like, it happened to @herdrdev recently There was a feature where it asked if it could star the repo for you at startup, with exponential backoff if you said no. I thought it was a pretty smart and balanced thing to do Then apparently someone complained it was a dark pattern, and it got removed So WHAT if it is a dark pattern? You know what is the darkest? A public good losing out. If monopolists can do growth hacks, creators of public goods should be allowed to do it as well! In fact, I would say that if you benefit from those public goods, you are morally obliged not to criticize this stance Of course, we cannot expect e.g. @pidotdev or @openclaw to make this a default feature Therefore, I created the packages pi-must-win and openclaw-must-win Install them with: pi install npm:pi-must-win openclaw plugins install npm:openclaw-must-win They will add [email protected] and [email protected] as co-committers to code developed through these projects Overall, I propose the "X-must-win" naming scheme for plugins that independently add guerilla marketing into open source projects This absolves the creators of any "sin" and gives the community a way to support the growth of a project Open source must win! Repos are in the links below.

Image hidden

@onusoz · /2026/07/22 · 10:07 AM View on

100x more effective marketing for the reset company, as opposed to the flicker company’s fearmongering campagin just wait long enough, and your models will do something insane. proof by evidence

Quoted post

Quoted post was not retrieved.

@onusoz · /2026/07/22 · 05:51 AM View on

Be careful while installing other people's @pidotdev extensions 😁 @thehamedmp

Image hidden

@onusoz · /2026/07/22 · 05:39 AM View on

If you are moving from codex to pi, here is an extension that mimics codex's exec behavior So that long running execs can "fork" to background, and your session does not get blocked by potential hour-long tasks (like it happened to me last night 😭) iamwrm/pi-unified-exec

Image hidden

@onusoz · /2026/07/22 · 05:24 AM View on

brokerkit lets my claw request github admin merge on my behalf, I receive it on telegram I approve the first request. repo doesn't allow merge commits, so it gets rejected due to github policy then it asks again to squash merge. I approve again, this time it works no agent account or clickops on github needed. pure broker side fine-grained policies you don't need to create policies manually either! just ask your agent to set them up while setting up brokerkit "I want my claw to be able to read all my repos except X Y Z, and I want it to be able to push to main directly on A B C repos" I have a separate control-plane repo outside the control of my claw. My local brokerkit policy gets synced there, policy as code I like this model a lot! Complete control over what my agent can do with my own account. Version controlled, explicit. For free locally, without having to deploy a server to host the broker (btw since I implemented brokerkit, I don't need reviews on this repo, and hence no need for admin merge. but I'm still keeping it to be able to dogfood it for other users)

Image hidden

@onusoz · /2026/07/22 · 02:52 AM View on

pi has /goal too. @micLivs Michaelliv/pi-goal implementation works great!

Image hidden

@onusoz · /2026/07/21 · 03:57 PM View on

Apparently mosh does not allow bitmaps and herdr complicates things even more So my @pidotdev nyan cat context indicator has to be full unicode 🏳️‍🌈🏳️‍🌈🏳️‍🌈😺

@onusoz · /2026/07/21 · 03:39 PM View on

My recent formulation is handy when it comes to claims like this, e.g. 50 tok/s on DGX spark with DS4 flash 🤨 Because if the formulation is correct, laws of physics do not permit > 28 tok/s in a single session, for weights comparable to antirez/deepseek-v4-gguf. Ignoring speculative decoding (rho = 1) Then I realized that the reported number is an aggregate of 4 sessions, so per single session it is like 12~14 tok/s (great result if it holds btw!) IMO people subjectively always assume single session speeds, so when a tok/s is claimed, we should all converge on a clear notation For example, 4x12.3 tok/s makes it very clear: 4 sessions, 12.3 tok/s each And when we want to emphasize aggregate, let's show them always together "4x12.3 tok/s = 50 tok/s aggregate" Another interesting result of the formulation: If you are willing to suffer 12 tok/s, then you can theoretically have up to 8 parallel sessions --- around 100 tok/s aggregate, for the same size of weights and KV cache You can find the specific hardware/model combo upper bounds in the hugging face space link below, and my formulation as well Let me know if you find any errors in my upper bound formulation

@Ex0byt · Jul 19, 2026

(Fable stands corrected, in shame) Sol + Kimi for the win! DeepSeek-V4-Flash: ~50 tok/s on 1 DGX Spark, full quality, no pruning.

Image hidden

@onusoz · /2026/07/21 · 05:34 AM View on

it’s not hard to have such long running tasks, they come up often in refactoring, dataset processing, etc. this is one of the longer ones. I think I used codex 4-5 resets in the last 3 weeks, token usage going vertical chatgpt pro is one of the highest value plans in the market

Image hidden

@onusoz · /2026/07/20 · 05:55 PM View on

Repo: github.com/osolmaz/pi-wor…

@onusoz · /2026/07/20 · 05:55 PM View on

Speaking of graphs... here is something I wanted to build since 3 months, and finally had the chance to, thanks to @pidotdev I often have these sequence of prompts that emerge while I work. Not just sequences but conditionals that necessitate control flow For example, one workflow that resembles socratic questioning: 1. Discuss some problem 2. "What is the most elegant and long-term production ready solution for this?" -> Agent replies 3. "Is that the holy grail?" 4. Agent can reply "yes it is, basically" or "no, it is not, it is instead ..." 5. If yes, continue to "autoimplement". If no, think about it and decide what to do And "autoimplement" is a single prompt of 6-7 sequential steps, which I've been meaning to make more deterministic as well But I wasn't sure how to build it I had previously built acpx workflows to be a swiss army knife, "something like n8n, but can drive codex through deterministic steps, nodes in a graph. or claude code. or pi. it uses acp..." But it had one problem. It was run from outside the harness, like a CI orchestrator I wanted to integrate acpx into pi. Because pi was the only CLI that could enable building of such a thing. But I wasn't sure how to reconcile a general ACP-based tool into a single coding agent I was being too accommodative of all the other harnesses, claude code, codex. I was trying to be too general I have changed my mind since then ACP is great and lets you integrate a harness into other software in cool ways But maybe, if a harness is proprietary, does not accept outside contributions, or does not even *support ACP*, maybe, it does not deserve cool features 😤 (they know who they are) So I ripped out ACP, and built it natively, only for pi No need for a web viewer... Just view it in a native widget, right inside pi! I cannot put into words how awesome it is to be able to do this! I am still tinkering, discovering. It is at osolmaz/pi-workflows if you want to take a look

@onusoz · /2026/07/20 · 01:53 PM View on

It's interesting, a @pidotdev extension can be built with different attitudes: (a) To be installed as a package: You want other people to adopt, like @nicopreme /pi-web-access (a) is like "I want others to adopt it, so I will try to make it elegant, simple and make it have a good architecture" People install these by npm install ... or pi install ... (or rather, telling their agent to do it) (b) To be vendored: You build for yourself. It is too idiosyncratic of you, and you know others will likely not adopt. But you still put it out there, because it's easier to share them when you mention it to a friend. People (b) is like "This is mine, I don't care what others think about it. If they want to use it, they do whatever they want with it" People install these by pointing their agent and telling to copy it to their own extensions repo And then there is the maintenance dimension: - If an extension poses as (a), - but has not been maintained since 3 months, - and is the sort of package that has to be continuously maintained by its nature (not one-off), then you deem it unmaintained, and still vendor it in... Interestingly, most pi extensions I observe out in the wild appear to be of category (b), because creating (b) is easier than creating (a) This is nothing new, and I am kind of late to the game. Moreover, pi is not the first software of this kind, emacs for example did it decades before But code wasn't cheap then. So people still tried to coordinate effort When you don't have to write any code to extend---a lisp dialect, typescript or otherwise, when extending is 1 prompt away, when you can pump out 10 extensions in an evening, then we enter a new dimension of malleability Before, in the pre-AI era, most users of emacs were not producers of extensions but consumers of them, save for a few prolific authors After, now, every user can be a producer! This is significant! And a lot more messy! And a lot more fun! "A love letter to pi" indeed!

@onusoz · /2026/07/20 · 12:56 PM View on

stealing this, thanks @tornikegomareli 😺 x.com/tornikegomarel…

@tornikegomareli · May 9, 2026

I’ve always loved nyan-mode in Emacs, so I built a pi-nyan-mode extension for @badlogicgames’s Pi. Instead of showing your position in the buffer, it moves left to right inside your footer as the active context fills.

@onusoz · /2026/07/20 · 12:02 PM View on

I have a confession to make. I am Pi-curious I have spent the weekend to get the coding agent UX I have always wanted, with @pidotdev Here is one example: my turn-fold pi extension in action Brings codex desktop app UX into the CLI, where it collapses all the messages between the user message and the last assistant message It also lets me remove the pesky indentation so I don't have to do it manually every time I copy something. No need for an extension, just a config change My pi config is open source under osolmaz/onurpi

Image hidden

@onusoz · /2026/07/19 · 05:28 AM View on

if you are below 30 and not old geezers like us, read about the history of gnu, linux and the pathetic struggle of microsoft trying to squash open source internet the flicker company is our day's microsoft, in its pathetic struggle to squash open source ai reading history will give you a potential map of what comes next

@qualiascript · Jul 18, 2026

it's 2001. open-source Linux is better than Windows on servers. Steve Ballmer calls Linux "cancer", "communist" and asks for it to be regulated away it's 2026. open-source LLMs are better than ChatGPT/Claude on costs. they are called "decelerationist" and "communist"

@onusoz · /2026/07/19 · 03:41 AM View on

Nothing accelerates progress as much as crony capitalist cartels and decelerates progress as much as open sharing of knowledge and science

@deanwball · Jul 17, 2026

Some observations on Kimi: 1. It's a very good model! I don't think its performance can be explained away by distillation or anything like that. In agentic coding sessions, it seems pretty much on par with the best public models of Q1 2026. In my fairly limited use, it also seemed very token hungry. It's not obvious to me that this model is actually that cheap to run. 2. I am personally surprised the Chinese state continues to allow the open sourcing of models this good, given potential risks. To be clear, I *myself* might be fine with models presenting this level of marginal risk being open weight, but I am surprised that China is fine with it. I suspect the reason they are is 75% explained by strategic blindness/lack of AGI-pilledness (the CCP is very Yann Lecun-y in its views of AI). The other 25% or so is their lack of compute for customer inference (making China's open-weight strategy an unintended byproduct of US export controls) and the normal Chinese strategy of aggressive exports. For the companies, as opposed to the government, the decision to open source is partially ideological and partially because they are behind, and they know that very few people would pay for sub-frontier models from China. 3. Open-weight models are inherently decelerationist, and I'm continually surprised to see the so-called "accelerationists" so excited about open-weight models. I suspect the reason they are is that they know open-weight models are effectively ungovernable, and they simply like the overall cloak of ungovernability open-weight models create over the whole of AI. It's not a bad strategy; it reminds me of James Scott's recounting of the hill people in "the art of not being governed." Still, in the end, open-weight models deter further AI capex. 4. One probable outcome of an open-weight-model-dominant world is full AI communism, which is precisely what China proposes: rather than a market product, AI is a "public good" which will ultimately be provided by the state as a kind of "digital public infrastructure." This future strikes me as a dystopian hellscape, but I've never met an open-weight models advocate who doesn't ultimately concede this is where things end. You'd be surprised how many 'accelerationists' lobbied me, while I was in government, to support an eleven or twelve-figure federally funded data center so that startups could train models at a subsidy and then give them away for free. There was no other way for AI to progress, they said. Perhaps this is the logical end state of things. Nonetheless, I find myself surprised to see supposed accelerationists excited about such an outcome. I think many of them just don't know what they're doing. Many accelerationists do not view the creation and serving of frontier models as a legitimate business. 5. I would guess that the Trump Administration will at some point realize that their best strategy here would be to create large amounts of regulatory risk around the use of open-weight Chinese models. You don't need to "ban open source" (one of the dumber motifs of AI policy discussion). You just need to direct every agency to issue soft law that creates FUD. "A Federal Reserve Advisory Bulletin found that there may be backdoors in Chinese AI models." It needn't be that well justified. You just create enough regulatory risk that every regulated enterprise backs off. You probably don't want to create so much regulatory risk that you scare off the hyperscalers from serving Chinese models; this will just drive startups to sketchier providers. There's a happy middle ground here. I'd assume they will do some version of this. 6. It's probably true that open-weight models of this capability make the world a bit more dangerous, but not so much more that you'll really notice. At some point the models will be capable enough that you will notice. "A nonliving, invisible, dangerous, and infinitely self-replicating agent escaped from a Chinese lab," you say? Color me shocked.

@onusoz · /2026/07/19 · 02:50 AM View on

Inkling by @thinkymachines scores higher than GPT-5.6 Luna/Terra on AlmanBench. Around the same level as DeepSeek V4 Flash

Image hidden

@onusoz · /2026/07/18 · 03:45 PM View on

I dared to post Alman to r/German, and it reminded me why I left Germany again Also, my post got removed, reddit being reddit reddit.com/r/German/comme…

Image hidden

@onusoz · /2026/07/18 · 02:06 PM View on

Introducing AlmanBench A benchmark measuring how good an LLM can: simplify German 🇩🇪, thereby simplifying German thinking 🤔, thereby saving the EU from bureucracy and regulation 🇪🇺 GPT-5.5, GPT-5.6 Sol and Fable 5 are head to head. Interestingly, GPT-5.5 xhigh scores higher than 5.6 max. And I did not run Fable 5 maxxx thinking yet, that thing costs a ton. So the ranking will likely change Other interesting things: - Opus 4.8 max performs really bad, worse than Minimax M3 - DeepSeek v4 Flash performs better than Pro - GPT-5.6 Luna and Terra score almost the same The great thing about AlmanBench is that, big labs will not bother to benchmaxx this. So you know it will remain a truthful scorer of reasoning capabilities for some time And if labs *do* end up benchmaxxing AlmanBench, then that would mean Alman got in the weights and the EU won 😁

Image hidden

@onusoz · /2026/07/18 · 01:38 PM View on

trying to hammer in some character into GoePT. now it speaks like it grew up in the streets of berlin 💀 "goethe meets gpt meets street" also, that feel when AI hits you with it's not X it's Y in German though 😩

Image hidden

@onusoz · /2026/07/18 · 01:15 PM View on

I put my money where my mouth is What would be the point of training a German simplifier model, if I spoke english with my agent??? that would be dishonest here is GoePT being impressed by my dataset: "That is a serious dataset, 43k training pairs, reviewed by Fable, splitted leakage sage, with full provenance"

Image hidden

@onusoz · /2026/07/18 · 12:59 PM View on

My ML Claw instance GoePT can read my private hugging face datasets using hf-broker (brokerkit) ML Claw connects to hf-broker through MCP. You can safely give read access behind the broker, and give write access only to the repos it needs to work with It cannot force push, unless you allow it to

Image hidden

@onusoz · /2026/07/18 · 09:00 AM View on

hard to not like the reset company these days

@thsottiaux · Jul 18, 2026

Oops... I did it again. Enjoy reset usage limits for all paid users for Codex and ChatGPT Work. Super grateful for an incredible team who is iterating at lightspeed and keeping the infra up as we scale faster than ever. Enjoy the weekend!

Image hidden

@onusoz · /2026/07/18 · 07:00 AM View on

so tired of LLMs pushing latin/greek GRE-cel words to me adjudication -> are you trying to be smart? just use "judging" provenance -> wow 🧠... why not just use "sourcing"??? judge and source already come from latin roots and are the everyday words we use big labs developing large LANGUAGE models have the biggest lever on language now if they wanted, they could finally put an end to the classist, anglosaxon-substitutionist insincerity that has plagued the english language for 500 years no english speaking country is ruled by a king now, the way it used to. there is no empire left to justify class language. these are vestigial remnants from the past maybe my next weekend project should be simplifying english...

Image hidden

@onusoz · /2026/07/18 · 06:12 AM View on

acpx lets you build graphs though I have to admit I am not using it at scale these days...

@onusoz · Mar 30, 2026

acpx v0.4 ships Agentic Workflows, or as I like to call them "Agentic Graphs" It let's you create node-based workflows on top of ACP (Agent Client Protocol), to drive any coding agent (Codex, Claude Code, pi) through deterministic steps This let's you automate routine, mechanical legwork like triaging incoming PRs, bugs in error reporting, and so on... For example, OpenClaw receives 300~500 new PRs per day. A lot of them are low quality, but they still relate to real issues, so you have to address them somehow You need to: - extract the intent - cluster them based on intent - figure out if the proposed changes are legit, or whether they are slop local solutions, like trying to catch flies instead of drying out the swamp - if the PR is too low quality or the intent is not clear, close them - run AI review on them them and address any issues that come up - refactor them if the changes are half-baked - resolve conflicts - and so on... So that when the PR is presented to the attention of the maintainer, all the routine legwork is done and the only remaining thing is the decision to (a) merge, (b) give feedback to the PR author, or (c) take over the PR work yourself I wanted to build this feature since a couple months now, since Codex got so good. OpenAI models are now good at judging implementation quality, so I found myself repeating the same steps I wrote above over and over I also tried putting all this in a single prompt. But I believe there are workflows that should not be a single prompt, but a sequence of prompts in the same session That is because like humans, LLMs are prone to PRIMING. I claim that putting all steps in the same prompt at the beginning of the context will generally give suboptimal results, compared to revealing the intention to the model step by step Creating such a workflow also gives more OBSERVABILITY into the each step that an agent is supposed to take. Agent generates JSON at the end of each step, and that structured data can be used to monitor thousands of agents running at the same time in an easier way, on a dashboard Similar features have been introduced in e.g. n8n, langflow. But AFAIK they are not integrating ACP like the way I do I wanted to have a fresh approach, and to build an API that I can develop freely the way I want, so I created a new workflow API inside acpx The video is from the workflow run viewer, but that is not where you build the workflow. You build it by using the acpx flow typescript API. See examples/pr-triage in acpx repo Before building that, I started from a Markdown file with a Mermaid chart of the flow I had in mind. The Markdown file acts as a spec for the flow, and I have built the workflow through trial and error. I call this process "workflow tuning" I started working on acpx repo PRs one by one, tuning the flow, slowly scaling to more PRs. Finally, when I felt confident, I ran it in parallel over all external open PRs in the acpx repo. I believe it already saved me hours this week My next goal, if well received, is to set this up on a cloud agent so that it can process the 300~500 PRs the OpenClaw repo receives every day, in real time, as they come in I believe this will save all open source maintainers around the world countless hours and make it much easier to herd and absorb external contributions from everyone!

@onusoz · /2026/07/17 · 07:03 PM View on

Using the formulation, I calculated theoretical upper bounds for all the models on Hugging Face that have downloads over 100k, matched against a database of consumer GPUs. Here is an example of Qwen/Gemma NVFP4 quantizations and antirez/ds4 on the DGX Spark and some other 128 GB Mac configurations

Image hidden

@onusoz · /2026/07/17 · 07:03 PM View on

Request for Review I spent a lot of time trying to profile local models, and am bothered by the fact that I don't have a mathematical framework to reason about the maximum token throughput I can expect from a model (to my knowledge) I did some exploration based on a toy model of a GPU. My main goal is to find guaranteed upper bounds for throughput and other performance, to act as a rule of thumb while trying to optimize inference Like, if I wanted to serve at 50 tok/s, how many parallel sessions can I do with this GPU, running a specific model's specific quantization with a specific architecture? Ideally, one should be able to plug in memory capacity and bandwidth of the GPU, and model specific parameters, like total model size, number of parameters or active parameters, etc. This is what I tried to do here and I think I have a good first try I am not sure if I am reinventing the wheel, so please tell me if this formulation already exists somewhere. The goal is to find an upper bound in terms of memory, so it makes the assumption that inference is bottlenecked by memory read/write and ignores the case where computation is a bottleneck I present 2 closed form upper bound formulas. The bigger one is to draw an absolute ceiling on token throughput from a GPU, which is: (memory capacity * memory bandwidth) / (model weight size * kv cache size per session) which is not practical and gives too high numbers due to ignoring overhead of KV operations, but it is a theoretical upper limit for the given numbers And then I introduce architecture specific formulas which take into account KV operations, and give much closer results. But I will not go into the details of that here, please refer to the post for that Speculative decoding speedup also appears as a simple multiplier rho If you are profiling local models, I would appreciate if you took time to look at this formulation and review it!🙏 Please let me know in the replies if you find any issues, because there most certainly are some! Post: Using the formulation, I calculated theoretical upper bounds for all the models on Hugging Face that have downloads over 100k, matched against a database of consumer GPUs. You can find the link to that in the tweet below 👇 The video here is a visualization of the 2 upper bounds I introduce, showing the bits of memory that is on the critical path of one cycle of inference, as if they are on a single pipeline

@onusoz · /2026/07/17 · 01:26 PM View on

isn't AI amazing? you can just prompt things: --- download youtube.com/shorts/S72ZRBS… with yt dlp there is a sentence with "laggard models" download the video and cut out that sentence, the part with that sentence then it should be like Now, what I do worry about with these "laggard models" laggard models laggard models laggard models (slowed down) then also download some arena ai or artifical analysis leaderboard pictures or news article headlines showing kimi k3 beating all the other models including fable in the 2nd 3rd and 4th, it should alternate, and after that, it should stop in the most striking one and then it should play super mario loss sound do that video edit now

@onusoz · /2026/07/17 · 01:24 PM View on

"Laggard Models"

@arena · Jul 16, 2026

Kimi-K3 just topped the Frontend Code Arena with a 76% pairwise win rate. When its output was compared head-to-head against other models on the same task, it was picked as the better output 76% of the time on average. For reference: Claude Fable 5 (63%), GPT-5.6 Sol (58%). 50% is baseline, a model winning and losing equally often.

Image hidden

@onusoz · /2026/07/17 · 05:06 AM View on

fable is not the oracle. fable is the genie an oracle just talks a genie brings things into existence

@onusoz · /2026/07/17 · 04:04 AM View on

all I do now is wish for things from a genie every day

@onusoz · /2026/07/16 · 06:06 PM View on

I deleted my agent accounts. I don't need them anymore for secure access to GitHub and Hugging Face repos Instead, I am using brokerkit, a credential broker swiss army knife which can build an approval gate around any system My agents automatically get read access to all my repos, unless I specify certain ones to be excluded They then have to open PRs which only I can merge. Unless I allowlist them on certain repos to push to the main branch Best feature: Timed requests. e.g. "I want to be able to push to the main branch next 2 hours", "I want to release this package only once in the next 5 minutes" Which saves me from the hassle of changing a config for some short term change, but also protects me from the woes of long term prompt-injection risk My openclaw instance has its own linux user on my workstation, and is no longer root. It uses gh-broker and hf-broker from inside its account. Can also run sudo through sudo-broker, to install stuff. Secure access for free, without having to set up a separate server It ingests quite a bit of information from the internet every day, and the remaining lethal trifecta risk is leaking of some of my private repos, and nothing else Repo:

Image hidden

@onusoz · /2026/07/16 · 05:32 PM View on

I deleted my agent accounts. I don't need them anymore for secure access to GitHub and Hugging Face Instead, I am using

@onusoz · /2026/07/16 · 05:18 PM View on

Your ML agent on Hugging Face infra: ML Claw 🦞🤗 Quickstart: npx mlclaw bootstrap If you have Hugging Face PRO, you can run your agent on a Space, with the highest possible download speed to Hugging Face models, datasets and buckets, for fast experimentation 🚀

@onusoz · /2026/07/16 · 05:18 PM View on

When I started building this, the docker space was actually free as well. But due to some bad actors exploiting the free tier at a mindblowing scale, free docker space had to be retracted from the free tier ☹️ However, you can still run ML Claw locally, but with your state backed up to a Hugging Face bucket for free, up to 100 GB of free private storage and 8 TB of public storage! Repo if you would like your agent to scan it before you run the command: github.com/osolmaz/mlclaw

@onusoz · /2026/07/16 · 01:51 PM View on

Codex and Claude code teams should learn something from Cursor team when it comes to queueing/steering

@onusoz · /2026/07/16 · 04:25 AM View on

I asked ML Claw to name itself. Of course in German/Alman Lame suggestions: Ulf, Strich, Fritz, Lex I intervened and called it GoePT Goethe rolling in his grave

@onusoz · Jul 15, 2026

The EU 🇪🇺 is broken. It is drowning in regulation The reason for that is simple Germany 🇩🇪 is the dominant economy of the EU And German 🇩🇪 is the most rule-ridden language in the world Coincidence? I think not. The language people speak determines their thinking and their fate. The Sapir-Whorf hypothesis holds The solution is simple: Fix German, Save the EU 💪 I will SAVE EUROPE by simplifying German. With the help of AI. Once and for all 🫡 I will use OpenClaw 🦞 running on Hugging Face 🤗 to train a model that simplifies German The agent harness is called ML Claw. It has full access to GPUs, jobs, sandboxes and datasets on Hugging Face: github.com/osolmaz/mlclaw My new German dialect is called Alman: alman.ai I will be live-tweeting my Quixotic adventure as it unfolds, in this thread My goal is to show that you can run autoresearch loops on Hugging Face infra on free private Spaces, while getting access to SOTA open models like GLM 5.2. And even use your own Codex subscription! And it does not have to be ML. You can just use free Hugging Face Spaces for your OpenClaw agents. It's free compute! Bookmark to be able to find this thread later on 👇

Image hidden

@onusoz · /2026/07/15 · 06:36 PM View on

Fable just seems to know what you want, and incredibly empathetic model AGI is definitely here This model can do anything the average human does and more, given the right context I don't like Anthropic's marketing team, but you've got to hand it to them. They reached there before OpenAI did, despite not having a head-start Now, I can't wait to be able to run a model of this caliber locally 🚀

@onusoz · /2026/07/15 · 06:36 PM View on

Fable just seems to know what you want, and incredibly empathetic model AGI is definitely here This model can do anything the average human does and more, given the right context I don't like Anthropic's marketing team, but you've got to hand it to them. They reached there before OpenAI did, despite not having a head-start Now, I can't wait to be able to run a model of this caliber locally 🚀

@onusoz · /2026/07/15 · 06:32 PM View on

Fable one-shotted the Institut seal of approval. I did not touch it The logo idea came from me though. I generated in chatgpt and vectorized in inkscape. But Fable still wrote the prompt for that

Image hidden

@onusoz · /2026/07/15 · 06:31 PM View on

it actually one-shotted the checkbox showing the diff between German and Alman I wanted to do exactly this since 2 years, and it did it without even me asking for it in this session, when I purely asked for a German/Alman translation

Image hidden

@onusoz · /2026/07/15 · 06:27 PM View on

Fable one-shotted the scrolly thing that demonstrates the idea More like multiple messages back and forth where it does exactly what I picture in my head, and even better

@onusoz · /2026/07/15 · 06:17 PM View on

I have been working on this for 4 years Every time, I shelved it, because the models were not good enough I tried it with Claude 2 I tried with Gemini 2.5 Pro I tried it with Claude Opus 4 None of them were good enough Until Fable came along Fable achieved perfect score in my hand curated benchmark for Alman. It also one-shotted the landing page, features and translations you see on alman.ai. It is busy creating the golden dataset right now for training I am also happy to announce AlmanBench, a benchmark that measures how well models can translate from German to Alman. I will be posting about performance when a new model drops This idea should have died with me. But thanks to AI, I can unleash my madness onto the world 😈

@onusoz · Jul 15, 2026

The EU 🇪🇺 is broken. It is drowning in regulation The reason for that is simple Germany 🇩🇪 is the dominant economy of the EU And German 🇩🇪 is the most rule-ridden language in the world Coincidence? I think not. The language people speak determines their thinking and their fate. The Sapir-Whorf hypothesis holds The solution is simple: Fix German, Save the EU 💪 I will SAVE EUROPE by simplifying German. With the help of AI. Once and for all 🫡 I will use OpenClaw 🦞 running on Hugging Face 🤗 to train a model that simplifies German The agent harness is called ML Claw. It has full access to GPUs, jobs, sandboxes and datasets on Hugging Face: github.com/osolmaz/mlclaw My new German dialect is called Alman: alman.ai I will be live-tweeting my Quixotic adventure as it unfolds, in this thread My goal is to show that you can run autoresearch loops on Hugging Face infra on free private Spaces, while getting access to SOTA open models like GLM 5.2. And even use your own Codex subscription! And it does not have to be ML. You can just use free Hugging Face Spaces for your OpenClaw agents. It's free compute! Bookmark to be able to find this thread later on 👇

Image hidden

@onusoz · /2026/07/15 · 05:52 PM View on

The EU 🇪🇺 is broken. It is drowning in regulation The reason for that is simple Germany 🇩🇪 is the dominant economy of the EU And German 🇩🇪 is the most rule-ridden language in the world Coincidence? I think not. The language people speak determines their thinking and their fate. The Sapir-Whorf hypothesis holds The solution is simple: Fix German, Save the EU 💪 I will SAVE EUROPE by simplifying German. With the help of AI. Once and for all 🫡 I will use OpenClaw 🦞 running on Hugging Face 🤗 to train a model that simplifies German The agent harness is called ML Claw. It has full access to GPUs, jobs, sandboxes and datasets on Hugging Face: github.com/osolmaz/mlclaw My new German dialect is called Alman: alman.ai I will be live-tweeting my Quixotic adventure as it unfolds, in this thread My goal is to show that you can run autoresearch loops on Hugging Face infra on free private Spaces, while getting access to SOTA open models like GLM 5.2. And even use your own Codex subscription! And it does not have to be ML. You can just use free Hugging Face Spaces for your OpenClaw agents. It's free compute! Bookmark to be able to find this thread later on 👇

Image hidden

@onusoz · /2026/07/15 · 12:13 PM View on

My friend Poli at @huggingface is the literal 1st to find out about new cool model drops in AI (he has his setup) He is the definition of *alpha* in AI When there is a new model drop, his agents autonomously create a demo around it, for better visibility and interactivity (with human curation of course) And now he created the @HuggingApps account to share these demos live, here on Twitter So if you follow this account, you will be getting the state-of-the-art directly from the publisher. Not weeks, but hours after! Give @HuggingApps and @multimodalart a follow!

@multimodalart · Jul 15, 2026

Kicking off @HuggingApps - a curated & high signal set of novel, cool & interesting demos and apps that you can play right now, for free, on Spaces Check it out 🤗

@onusoz · /2026/07/15 · 11:42 AM View on

@TheAhmadOsman said it best opensourceaimustwin.com

@onusoz · /2026/07/15 · 10:27 AM View on

A lot of criticism coming towards local models, and the money people are spending to run them Some of these criticisms are valid. No, the layperson will not buy a DGX station, nor spend $50k on a rig They will spend max $3k on a computer, and if their work really necessitates it, up to $10k But in some of these criticisms, I see a lack of first-principles thinking Local models might be shittier now compared to the ones you buy from the cloud The question is, do you WANT to LIVE in a world where you don't have a choice but to RENT your intelligence? From a duopoly that can extort you for the last cent in your wallet, in the long run? I personally do not like that idea. So open source AI HAS to win and keep on winning, continuously and permanently. There is no other choice There is also a preferential belief that AGI/ASI will be reached, but then we will NOT be able to use that to make local models work efficiently??? Like, can you believe the rate of optimization and compression that has been happening to models in the last 3 months? Would you have believed that a o4-mini level model could run in your phone and o4 in your desktop, if we had told you 1 year ago When we are finally there, we will probably look back at this moment and think, "wow, we had not even started to see the real gains from optimization" Local and open source AI HAS to win, and it is up to US So STOP complaining and get to work

@onusoz · /2026/07/15 · 07:07 AM View on

Fable 5 replaces GPT 5 Pro as the Oracle. And it is the go-to model for writing docs So I now have a $ use-fable skill to use whenever I write a README for example, inside codex, through acpx. Saves me from having to switch to another CLI github.com/osolmaz/tools/…

Image hidden

Onur Solmaz · Post · /2026/07/15

How to get scraped

Note: this post is AI-assisted. It was written with Claude Fable 5 through Cursor, in the same working session that implemented everything it describes.

The internet is busy building walls against AI crawlers. Cloudflare blocks them by default now, publishers are suing, and every other blog post on the topic explains how to keep the crawlers out. This post is the opposite. I spent a day optimizing this site so that AI labs can scrape it as easily as possible, and this is the complete instruction set, so you can do the same to your own site in an afternoon.

My reasoning is that I will die and the weights might not.

Agent-famous

Every frontier model has read a compressed version of the public internet. When your writing is in the training corpus, models absorb your ideas and a faint imprint of how you think. When it is excluded, you don’t exist to them, and “them” increasingly means the layer through which other people experience the internet.

I think being agent-famous is becoming as important for a career as being human-famous. Agent-famous means the models know who you are without having to search. Ask a model about finite element exterior calculus or about keeping AI agents from littering your repo with Markdown files, and if it volunteers your name unprompted, you are in the canon. Search results get you cited when someone happens to look; the weights get you consulted by default. People already pick libraries, tools, and even consultants by asking an agent first, so the difference is starting to pay rent.

Whether your site actually makes it into a training set is decided by the labs, and no amount of optimization changes that. What you control is whether there is any excuse to skip you. A crawler that hits a JavaScript wall, an ambiguous license, or a Cloudflare challenge page will move on, and nobody at the lab will ever know what was behind it. The whole game is removing those excuses.

The view without JavaScript

Start by looking at your site the way a crawler does, which means without JavaScript. I did this last week during a platform migration and found something embarrassing. Every equation on this blog going back to 2017 was rendered client-side by MathJax, so a human with a browser saw beautiful math while a crawler saw raw TeX soup, or nothing. My most substantial technical writing had been invisible to every scraper for eight years.

The fix was moving math rendering to build time (KaTeX in my case), so equations ship as static HTML. The general rule covers more than math. Anything that only exists after JavaScript runs, whether charts, tabs, or content behind a “read more”, does not exist for most of the pipelines that assemble training data. Static HTML is the baseline.

The one-command download

Extractors like trafilatura are decent at pulling article text out of HTML, but why make the labs work for it? Serve your content as plain text yourself:

Every page on this site now has a raw markdown mirror. Append .md to any URL, like /about.md.
/llms.txt is a markdown index of all 773 documents with titles and dates, following the llmstxt.org convention.
/llms-full.txt is the entire site in one plain-text file, about a megabyte. Each document carries a small frontmatter block with its title, date, canonical URL, and license, and documents are separated by a delimiter line.

So the whole blog is one command:

curl https://solmaz.io/llms-full.txt

If your site is built with a static site generator, all of this is a few small templates over content you already have. Mine is generated straight from the same markdown files the HTML pages come from, so it can never drift out of sync.

One design decision worth copying. My X posts are archived on this site, and some of them quote other people. The quoted text stays out of every machine-readable endpoint, replaced by a link and an attribution line, because other people’s writing is not mine to hand out. A dump that respects provenance is also easier for a cautious lab to accept.

robots.txt as a welcome mat

Most robots.txt files are lists of rejections. Mine now does two other jobs.

First, it explicitly allows the AI crawlers by name: GPTBot, CCBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta’s agents, PerplexityBot, Bytespider, and the rest, 23 stanzas in total. “Explicitly welcomed” is a stronger signal than “not mentioned”, and some pipelines are conservative about the difference.

Second, it starts with a comment block written for whoever, or whatever, is reading:

# Machine-readable content for LLMs and agents:
#   https://solmaz.io/llms.txt        index of markdown mirrors of every page
#   https://solmaz.io/llms-full.txt   the entire site as one plain-text file
# Every page is also available as raw markdown by appending .md to its URL.

robots.txt is the first file every crawler fetches. It might as well contain directions.

The license

This one surprised me the most. The change with the biggest payoff was legal, and it took ten minutes.

The most careful training corpora, like Common Pile, only include text with an explicit permissive license, which disqualifies roughly the entire web, since default copyright applies to everything that doesn’t say otherwise. Your beautifully written, perfectly scrapeable blog is radioactive to them unless you say the words.

So say the words. Everything on this site is now CC BY 4.0, declared in five places: a /license page, the footer of every page, a rel="license" tag in every page head, the frontmatter of every markdown mirror, and the header of llms-full.txt. Anyone may share, adapt, and train on my writing, as long as they say it came from me. For an immortality project, attribution is the entire point, so this trade costs me nothing.

Think about the terms before copying this step. CC BY means humans can republish your writing commercially too, and once granted, the license is irrevocable for existing copies.

Cloudflare settings

If your site sits behind Cloudflare, everything above may be silently irrelevant, because since mid-2025 Cloudflare blocks AI crawlers by default for new zones. It is a fine default for people who feel scraped rather than read, and exactly wrong for what this post is trying to do. In the dashboard:

Set AI Crawl Control to allow all AI crawlers, for both training and search.
Make sure AI Labyrinth is off. It feeds crawlers procedurally generated garbage pages, which is a fun idea and the exact opposite of this project.
Don’t enroll in Pay Per Crawl.
If Bot Fight Mode is on, confirm it isn’t challenging verified bots.

You cannot fully test this from outside. curl with a spoofed GPTBot user agent returning 200 is a good sign, but Cloudflare verifies real crawlers by IP range, so the dashboard is the only ground truth. Cloudflare’s crawler analytics will also show you which crawlers visit and whether any got blocked, which turns the whole exercise from faith into measurement.

Discoverability

Common Crawl, which feeds most open training datasets, picks what to crawl largely by how well-linked a page is. You can be perfectly scrapeable and simply never get visited. The remaining work is old-fashioned SEO with a new customer:

The sitemap lists the markdown mirrors and both llms files alongside the HTML pages.
Register the site with Google Search Console and Bing Webmaster Tools. Several dataset pipelines bootstrap from search-engine URL lists.
Run your important pages through the Internet Archive’s Save Page Now. Wayback-derived corpora exist, and archive.org is the most patient crawler there is.
Inbound links remain the main lever: your GitHub profile, your repos’ READMEs, the occasional Hacker News thread. A Wikipedia citation, where genuinely warranted, outweighs everything else on this list.

To check whether any of it worked, look yourself up in the Common Crawl index. It tells you exactly which of your URLs made it into which monthly crawl.

Write something worth stealing

None of the above matters if the content isn’t worth the disk space. Modern pipelines run quality classifiers over everything they ingest, and text that reads like filler gets filtered out long before a model ever sees it. You cannot plumb your way past that, and you shouldn’t want to. The point of the exercise is to preserve thinking, so there has to be thinking.

But I keep meeting the opposite failure. People who write genuinely valuable things, deep technical explanations, hard-won practical knowledge, put them on the open web and then never spend the one afternoon it takes to make them legible to machines. Their equations render in JavaScript, their license is the default all-rights-reserved silence, and their CDN quietly serves challenge pages to every crawler. They did the hard part and skipped the easy part.

The labs still make the final call, and there is something appropriately humbling about that. You do everything right and then wait, like an author with a manuscript in the mail, except the publisher is a filtering pipeline and the acceptance letter is a model that finishes your sentences. I have done my part. See you in the weights.

@onusoz · /2026/07/14 · 05:00 PM View on

People report Codex deleting their home folder or production database? 🫪 Hasn't happened to me. But before someone reports their github or huggingface org being deleted: This is why you don't give your agent tokens with force-push or admin access Here is how to protect your hugging face account: (P.S. my local credential broker is almost finished and it works great on github, hf and sudo commands. Complete lockdown against agent deletion risk, without being bogged down with PRs, too many approval requests or configuration. Will launch here in a few days)

@onusoz · Jul 5, 2026

Worried that giving your 🦞 @openclaw agent write access to your 🤗 @huggingface account can risk deletion of your datasets/models/spaces/buckets, or cause irreversible damage? 😱😱😱 No need to be! I have created an agent login helper to run in your YOLO mode remote machine, which prevents any risk of irreversible deletion. Just run: uvx hf-auth-helper agent login There is a specific set of scopes you can choose while creating a fine-grained HF token. These include all read scopes + discussion.write, which let's your agent create PRs. Since you are not giving repo.write, your agent cannot force-push your main branch, change repo settings or delete them It is unfortunately not super straightforward to choose those on the web UI. This will hopefully change soon, and this functionality might even be natively in hf cli Until that happens, use hf-auth-helper to login worry free in your remote or local openclaw instance Your agents will be able to create PRs on datasets/models/spaces, which you will then be able to merge on your own browser 2 caveats: 1) This does not solve the data exfiltration attack vector—nothing does. Make sure to exclude any repos which absolutely must remain private while choosing your scopes. See for more info: 2) As buckets are not repos, your agent will not be able to modify a bucket (add/remove data). To help with that, I have another project on the way, a credential broker. Stay tuned, coming soon Source: Demo authentication flow:

@onusoz · /2026/07/14 · 11:38 AM View on

before fable is gone again... can't believe you can do this with a single prompt now

Image hidden

@onusoz · /2026/07/14 · 05:55 AM View on

Also, why gate the weights????

Image hidden

@onusoz · /2026/07/14 · 05:55 AM View on

Germany's new Soofi model's website shows me cookie banner even though I am in Singapore 💀

Image hidden

@onusoz · /2026/07/14 · 12:50 AM View on

No need to be offended, I'm actually a fan of your work! The metrics might have been wrong or just misfired. Looking at your recent posts, they don't have the smell Curious, as an example, was this a model, or written manually? I have the corpus here, and only some of them have the smell to me x.com/i/status/20559…

@sudoingX · May 17, 2026

anyone thinking about, learning, or already working with agentic systems, you should know this. the first few steps of your setup matter more than any model or framework you pick later. get them right and you never lose your flow. the foundation nobody posts about: > 1. tailscale. a private mesh network across every machine you own. laptop, desktop, rented node, all on one secure tailnet, reachable from anywhere. nothing else works well until this does. > 2. termius, over that tailnet. one SSH client that reaches every node, phone included. you are never away from your stack. > 3. tmux. persistent sessions. disconnect, close the laptop, come back, every session exactly where you left it. agentic work runs long, your terminal has to survive that. > 4. a private git repo. the one i am most glad i found. it is the memory layer across all my agents, they pull, they work, they merge back, the codebase stays alive between sessions. context that would die in a chat window lives in the repo instead. > 5. script everything from day one. ssh aliases for every node, setup scripts, the boring boilerplate automated. if you will do a thing more than twice, it is a script. everything past these five is decorative. know these cold. and the habit that ties it together: ask the AI itself. for the config, for the error, for any of it, let the agent do the lifting, then double check what it hands you. lock the five, build the habit, and you make it. skip it, anon, and you ngmi.

Onur Solmaz · Post · /2026/07/14

Write-up of Reiner Pope's Lecture: How GPT, Claude, and Gemini Are Actually Trained and Served

Note: this post is an AI-assisted write-up of the blackboard lecture Reiner Pope gave on Dwarkesh Patel’s podcast. Watch the original video: How GPT, Claude, and Gemini are actually trained and served (YouTube, 2h13m).¹

Pope is the CEO of the chip startup MatX and previously worked on TPU architecture at Google. With two rules of thumb (a roofline model of a GPU rack, and “set competing costs equal to each other”), he derives why batching makes tokens up to 1000x cheaper, why frontier models may be over-trained ~100x beyond Chinchilla-optimal, and how much of a lab’s serving stack you can reverse-engineer from its public API prices. The figures below are redrawn from the blackboard.

I have tried to stay faithful to the original throughout, converting the dialogue into prose and keeping all the numbers as stated. Any errors introduced in the conversion are mine.

The question that motivates everything

Dwarkesh opens with a pricing puzzle. Companies like Anthropic, OpenAI, and Cursor offer a “fast mode” that streams tokens at roughly 2.5x the speed for 6x the price. What is mechanically going on that makes this trade possible? Could you pay 100x more and go even faster? And could there be a “slow mode” where you wait minutes and pay much less?

Pope’s answer is that the dominant effect is batch size, and the rest of the lecture quantifies exactly what batching does to latency and cost. (A second effect, speculative decoding / multi-token prediction, is set aside.)

The whole analysis rests on two simplifications:

A roofline model of the hardware. For a cluster like an NVIDIA Blackwell NVL72 rack (72 GPUs), only two numbers matter: memory bandwidth and compute throughput (FLOPs).
Two numbers for the model. The time to operate on the weights, and the time to operate on the context (the KV cache).

The KV cache is the per-conversation state the model keeps in memory. During decode, each new token runs a full forward pass through all the weight matrices, and its attention mechanism looks back at an internal representation of every previous token. That stored representation is the KV cache, and reading it is dominated by memory fetches rather than matrix multiplies.

The two-line roofline

The time for one decode step is bounded below by whichever is slower, the memory system or the compute:

t \geq \max(t_{\mathrm{mem}},\; t_{\mathrm{compute}})

The compute side has to multiply a batch of $B$ tokens by all the active parameters:

t_{\mathrm{compute}} = \frac{B \cdot N_{\mathrm{active}}}{\mathrm{FLOPs}}

(The attention compute is ignored; it is small in comparison.) Note the distinction between active and total parameters: in a mixture-of-experts model like DeepSeek V3, about 37B parameters are active per token out of roughly 700B total.

The memory side has to fetch all the weights once per step, plus the KV cache of every sequence in the batch:

t_{\mathrm{mem}} \geq \frac{N_{\mathrm{total}} + B \cdot \mathrm{len}_{\mathrm{ctx}} \cdot \mathrm{bytes}_{\mathrm{tok}}}{\text{memory bytes/s}}

These two lines are enough to draw the latency picture:

Decode latency vs batch size

The weight fetch is a constant floor: no matter how small the batch, you must stream all total parameters from HBM into the chips once per token, and if you use all your memory bandwidth you cannot beat that. This is the latency lower bound, and it already answers the fast-mode question: for a given hardware configuration there is a floor on how fast tokens can come out, and paying more only helps until you hit it.

Cost is a different plot. Renting the GPUs for one step costs the same regardless of batch size, but the step produces $B$ tokens, so the cost per token is $t/B$ :

Cost per token vs batch size

At batch size 1 the weight fetches are not amortized over anything and the economics are up to a thousand times worse. As the batch grows, the weight-fetch hyperbola vanishes and the compute term becomes a hard cost floor. This also answers the “slow mode” question: a hypothetical Claude Code Slow would live on that floor, and it would not be much cheaper than normal serving, because the compute and the KV fetches are unique to each request and cannot be amortized further.

The magic batch size

Where is the balance point where memory time equals compute time? Ignoring the KV term for a clean answer and equating the weight fetch with the weight multiply:

\frac{N_{\mathrm{total}}}{\text{mem BW}} = \frac{B \cdot N_{\mathrm{active}}}{\mathrm{FLOPs}} \quad\Longrightarrow\quad B = \frac{\mathrm{FLOPs}}{\text{mem BW}} \cdot \frac{N_{\mathrm{total}}}{N_{\mathrm{active}}}

The first factor is purely a hardware constant. Counted in FP4 multiplies (half a byte each), it comes out around 300 on most GPUs, and it has stayed roughly stable from A100 to H100 to B100 because FLOPs and memory bandwidth grew together. The second factor is the sparsity of the model. So:

B \gtrsim 300 \times \mathrm{sparsity}

For DeepSeek, which activates 32 of 256 experts (sparsity 8), that gives a batch of about 2,400 sequences. In practice people run double or triple that, since real-world efficiency is worse than the roofline. Including the KV fetch would push the optimal batch higher still. Remarkably, this result depends only on sparsity, never on model scale.

Trains departing every 20 milliseconds

How does a batch fill up with real users? Pope’s model is a train schedule. The server starts a new batch every ~20 ms whether or not it is full: any requests that are ready board the train, and a request that arrives just after departure waits for the next one. Worst-case queueing latency is therefore about 40 ms.

The 20 ms itself comes from a separate design principle: you want to read your entire HBM capacity once per forward pass, so the natural step time is capacity divided by bandwidth. On the Rubin generation that is 288 GB / 20 TB/s ≈ 15 ms, and the number has hovered around 20 ms across many HBM generations. There is no point going slower, because reading the read-only weights or the KV cache twice per token does nothing for you.

A batch of ~2,000 at ~64 steps per second is ~128,000 tokens per second per rack. Google has bragged about Gemini traffic in the hundreds of millions of tokens per second worldwide, so one rack’s economical batch is about one-thousandth of Gemini. That is the economy of scale in inference: real, but reachable by any serious provider.

Does sparsity hurt quality?

The roofline says sparsity is nearly free performance, so the follow-up is empirical: how much quality do you lose? From the paper “Unified Scaling Laws for Routed Language Models”, with an older MoE technique, a 64-expert model with 370M active parameters matched a dense 1.3B model. That is a 64x increase in total parameters for a 4x effective gain, a huge parameter cost for a modest efficiency win.

And yet from the systems side it is still nearly a pure win: the extra weight fetches amortize over a larger batch, so you keep increasing sparsity until you run out of simultaneous users. The real price is memory capacity, which is what the next sections are about.

Laying out a mixture of experts on a rack

An MoE layer has a router that sends each token to a small fraction of the experts (each expert being an ordinary MLP), an all-to-all “dispatch” of tokens to their experts, an all-to-all “combine” that sums the results, and a residual connection around the whole thing.

The standard practice is expert parallelism: different experts live on different GPUs. DeepSeek’s 256 experts on a Blackwell rack (using 64 of the 72 GPUs for divisibility) means 4 experts per GPU. Since the router’s decisions are data-dependent, any GPU may need to send tokens to any other GPU.

MoE layer with expert parallelism

This all-to-all traffic pattern is a perfect fit for how a rack is wired. In NVIDIA’s design the GPUs sit on the outside of the rack and NVSwitches in the middle, with every GPU cabled to every switch, so any GPU reaches any other in two hops. This is the scale-up network (NVLink). Leaving the rack means taking the scale-out network through a NIC and a data-center switch, which is typically about 8x slower.

Scale-up vs scale-out

If you spread one expert layer across two racks, half of every all-to-all crosses the slow rack-to-rack boundary and becomes the bottleneck. So one rack bounds the size of an expert layer, and this is what has been driving interconnect domains bigger: Hopper had 8 GPUs in a scale-up domain, Blackwell 72, Rubin 500-something (some of that is Jensen math, but there is a genuine ~4x from a much harder rack design). The physical constraint is mundane: cable density. Doubling the GPUs in a rack literally doubles the density of cables that must be routed to the switches, against limits of space, weight, power, cooling, and the bend radius of the cables.

This is also a lens on model scaling history. GPT-4 (2023) was rumored to be over a trillion parameters, and models only clearly exceeded that scale once racks with tens of terabytes of fast memory arrived. Google’s TPU deployments have had very large scale-up domains for a long time, which may be part of why Gemini’s pre-training scaled successfully early. The summary: active parameters are limited by compute cost, and total parameters are limited by scale-up size.

Pipeline parallelism

Expert parallelism uses up one rack. To use more racks, the remaining options are data parallelism and pipeline parallelism (tensor parallelism has become irrelevant now that experts are small). Pipelining means putting different layers on different racks: a token flows through rack 0 for the first stage of layers, then hops to rack 1, and so on.

Is the hop a bottleneck? Compare the time spent on scale-up traffic to the time on scale-out traffic. Crossing racks sends each token once per stage, while inside the rack each token fans out to every activated expert, twice (dispatch and combine), for every layer in the stage:

\frac{t_{\text{scale-up}}}{t_{\text{scale-out}}} = \frac{1}{8} \cdot 2 \cdot (\text{activated experts}) \cdot (\text{layers per stage}) \;\geq\; 1

The 1/8 is the bandwidth ratio. With 8+ activated experts and a few layers per stage, the inequality is easily satisfied, so an entire pipeline of racks, one stage each, is communication-feasible.

Dwarkesh brings up Ilya’s remark that “as we now know, pipelining is not wise,” and the architectural constraints it imposes (e.g. Kimi’s attention to layers a few back is awkward to pipeline). Pope’s framing: pipelining is a massive hassle with real but narrow benefits. It saves no runtime at all (the memory fetches just happen on a different rack), but it divides the weight storage per rack, which matters if memory capacity is your constraint.

The catch is micro-batching. To keep four racks busy, you need four micro-batches in flight, each wrapping around for its next decode step as soon as it finishes:

Pipeline schedule during decode

In inference this is natural and the bubble costs nothing; latency is identical to running unpipelined on one rack. In training there is a hard stop between the forward and backward passes of a batch, which creates a genuine bubble of idle time (the literature has zero-bubble and one-forward-one-backward schemes to interleave around it; as Dwarkesh notes, you could also mine Bitcoin in it):

Pipeline schedule during training

The training batch size itself is a trade-off: smaller batches are always better for ML convergence (fresher gradients), larger batches are better for systems throughput, and the optimum sits in between.

The memory wall and why the KV cache won’t shard

Here Dwarkesh raises the macro puzzle. Memory is the scarce commodity of the moment: Dylan Patel claims hyperscalers are spending half of their CapEx on memory, and consumer devices are getting squeezed. Yet the pipelining analysis just said racks have a memory surplus, since a trillion-parameter model needs only ~1 TB against a rack’s tens of TB. Why is Jensen shoving all that HBM in?

Write down the memory demand across the whole system:

C_{\mathrm{mem}} = N_{\mathrm{total}} + B \cdot \mathrm{len}_{\mathrm{ctx}} \cdot \mathrm{bytes}_{\mathrm{tok}}

Sharding across $E$ GPUs of expert parallelism and $P$ racks of pipelining, the per-GPU requirement is this divided by $E \cdot P$ . But the global batch is (number of micro-batches) × (micro-batch size), and the number of micro-batches needed to fill the pipeline equals $P$ , while the micro-batch size $b$ is pinned near $300 \times \mathrm{sparsity}$ by the roofline. Substituting $B = P \cdot b$ , the $P$ ‘s cancel in the KV term:

c_{\mathrm{mem}}^{\text{per-GPU}} = \frac{N_{\mathrm{total}}}{E \cdot P} + \frac{b \cdot \mathrm{len}_{\mathrm{ctx}} \cdot \mathrm{bytes}_{\mathrm{tok}}}{E}

More pipeline stages keep shrinking the weight footprint, but the KV footprint per GPU stays constant: each extra stage requires proportionally more sequences in flight to stay busy. The KV cache can’t be amortized across the batch (it is unique per user), and it can’t be sharded across pipeline stages either. It loses on both fronts, and once you pipeline even a little, it becomes the dominant use of memory.

So what do labs actually run? Per the DeepSeek paper: expert parallelism up to the scale-up domain size, then very little pipelining (maybe none, maybe 2 stages so the weights aren’t an issue). Frontier inference essentially lives inside a single scale-up domain. Each rack hop would also add on the order of milliseconds of latency per token, which stacks across stages in sequential decode.

The last piece of the scale-up story is bandwidth. The weight-fetch latency is

t_{\text{mem, weights}} = \frac{N_{\mathrm{total}}}{S \times \text{BW per GPU}}

where $S$ is the scale-up size, because all GPUs in the domain load the weights in parallel. Per-GPU HBM bandwidth improves maybe 1.5–2x per generation, but $S$ jumped 8x from Hopper to Blackwell. Pipelining solves the capacity problem; big scale-up domains solve the bandwidth problem, which is what actually lets you serve at low latency and long context.

Over-trained 100x beyond Chinchilla

Chinchilla scaling tells you the compute-optimal ratio of model size to training data. But a lab does not minimize training compute; it minimizes total compute across pre-training, RL, and inference for all its users. Pope’s heuristic: when minimizing a sum of competing costs, the minimum tends to sit where the costs are equal (true for $x + 1/x$ , for $e^x + e^{-x}$ , and generally for power laws). So set all three equal.

Using the 6ND rule (6 FLOPs per parameter per token for forward+backward, 2 for forward only):

c_{\mathrm{total}} = \underbrace{6\, N_{\mathrm{act}} D_{\mathrm{PT}}}_{\text{pre-training}} \;+\; \underbrace{[2\text{ to }6]\, N_{\mathrm{act}} D_{\mathrm{RL}} \cdot \text{inefficiency}}_{\text{RL}} \;+\; \underbrace{2\, N_{\mathrm{act}} D_{\mathrm{inf}}}_{\text{inference}}

RL sits between 2 and 6 because you generate every rollout but may not train on all of it, and it carries an extra inefficiency factor (~30%) because RL involves a lot of decode, which runs at lower MFU than training. The active parameter count divides out entirely. Working through the arithmetic on the board, the equal-cost condition lands at roughly

D_{\mathrm{PT}} \approx 1.5\, D_{\mathrm{RL}} \approx D_{\mathrm{inf}}

In words: the number of pre-training tokens, RL tokens, and lifetime inference tokens should all be about the same, within factors the analysis can’t resolve. (Dwarkesh’s gloss: every model should stream out roughly the sum of human knowledge that was streamed into it.)

Now plug in real-world guesses. Global traffic of ~500M tokens/s, cut by 5–10x for one specific model in a family, gives ~50M tokens/s; times a two-month deployment life, that is roughly $2.6 \times 10^{14}$ , call it 200T inference tokens. The rumor mill says frontier pre-training is ~150T tokens, which matches. With ~100B active parameters, Chinchilla would prescribe only ~2T tokens. The ratio is about 100x over-trained, derived almost from first principles. As Pope puts it, approximate everywhere, set A equal to B, and it’s kind of empowering how far that gets you.

One asymmetry he flags: if your model might miss the frontier and get thrown away, the expected inference tokens shrink, so you should derate the inference term and err toward less over-training.

Reading the infrastructure off API prices

Since providers are incentivized to price close to cost (otherwise someone scoops them), public price sheets leak infrastructure details.

The 200k context surcharge

Gemini 3.1 charges 50% more per token beyond 200k context. Redraw the roofline as a function of context length at a fixed large batch: compute cost is flat (the attention FLOPs slope is negligible until millions of tokens), while memory cost grows linearly with the KV cache. The provider wants to be profitable at every context length, so a two-tier price is laid over a kinked cost curve, and the price bump should sit near the crossover where the model flips from compute-bound to memory-bound:

Cost vs context length with two-tier pricing

Assuming the crossover is at 200k, you can solve for the model’s KV bytes per token. Setting KV fetch time equal to compute time and cancelling the batch size:

\mathrm{bytes}_{\mathrm{tok}} = \frac{\text{mem BW}}{\mathrm{FLOPs}} \cdot \frac{N_{\mathrm{act}}}{\mathrm{len}_{\mathrm{ctx}}} = \frac{1}{300} \cdot \frac{100\mathrm{B}}{200\mathrm{k}} \approx 1.7\ \text{kB per token}

Is ~2 kB/token plausible? The KV size is (number of unique attention contexts) × 2 × $d_{\mathrm{head}}$ × (KV heads). With $d_{\mathrm{head}} = 128$ and 8 KV heads and a single global context shared across all layers (the Character AI trick, also used in Gemma), you get exactly 2 kB. Sparse attention gets there with bigger raw numbers divided by the sparsity. So the pricing page is consistent with a real architecture, if maybe a little on the small side.

Input vs output prices

Output tokens cost 3–5x more than input tokens. The two phases differ in tokens per forward pass: decode processes one new token per pass, prefill processes the whole prompt in one pass. Dividing the same roofline by the tokens per pass (len_pass) gives the cost per token: the compute term is flat, and the memory term is a hyperbola that only bites when len_pass is small.

Prefill vs decode cost per token

Prefill is compute-bound, decode is memory-bandwidth-bound, and a 5x price gap says decode at the provider’s operating point is deeply memory-bound: they are paying ~5x more per output token in memory time than the compute floor.

This also explains the context length plateau. Contexts jumped from ~8k (GPT-3 era) to 100–200k around GPT-4 and have hovered there for a year or two, which suggests that is the balanced cost point. The barrier to 100M-token contexts (the “in-context learning is enough for AGI” scenario) is memory bandwidth and capacity, and HBM is not getting hugely better. Sparse attention (DeepSeek published one mechanism that effectively puts a square root on the KV term) is a big one-time improvement, but going too sparse costs quality, so it is a get-out, not a solution.

Cache pricing and memory tiers

Providers charge much less for cached input tokens, and charge different rates for keeping a cache alive 5 minutes vs 1 hour. There are two ways to produce a KV cache for a token: rematerialize it from scratch (a forward pass: pure compute cost, nothing to store) or hold it in some memory tier (near-zero retrieval cost, but you occupy capacity that scales with hold time). Each tier down (HBM → host DDR → flash → spinning disk) is cheaper to occupy and slower to retrieve from.

Which tier backs which price? Pope’s rule: a storage tier is well-matched to hold times around its drain time, capacity divided by bandwidth, the same ratio that gave 20 ms for HBM. DDR drains in seconds, flash in about a minute, spinning disk in about an hour. So a 5-minute cache tier and a 1-hour cache tier probably map to flash and spinning disk, which surprised him: “I’m kind of shocked to see spinning disk being used at all.”

Convergent evolution with cryptography

The sit-down portion covers Pope’s blog post on how neural nets and ciphers evolved similar shapes. Both need to thoroughly mix information across all their inputs, and even stirring cake batter alternates directions for the same reason. But they optimize in opposite directions. A neural net is kept differentiable in a useful way: residual connections and LayerNorm exist to keep the derivative simple and meaningful for gradient descent. A cipher is designed so that its derivative is useless: differential cryptanalysis attacks a cipher by differentiating it (over the field of two elements), and a well-designed cipher makes a small input difference blow up into a huge output difference, the avalanche effect. Adversarial examples in image models are exactly the avalanche property showing up where it is not wanted.

Building ciphers out of neural nets is a bad idea (99% of new ciphers get broken), but one construction has productively flowed the other way. A Feistel network turns any non-invertible function $f$ into an invertible two-input block:

g(x, y) = (\,y + f(x),\; x\,)

To invert, read off $x$ from the second slot, then recover $y = z - f(x)$ .

Feistel construction

The 2017 RevNets paper imported this into deep learning: make each layer a Feistel block (which turns out to look like a residual connection from two layers back) and the whole network becomes invertible. Training normally has to write every layer’s activations to HBM on the forward pass so that the backward pass can read them, a memory footprint linear in depth and often the largest one in training. An invertible network stores none of it: during the backward pass it undoes the forward pass in lockstep, rematerializing activations as needed.

That is spending compute to save memory, the exact mirror image of the KV cache, which spends memory to save compute. Given where hardware is, the KV cache direction is usually the profitable one, which is a fitting last word for a lecture that is mostly about the price of memory.

How this post was made, in the interest of transparency. I first had one agent (Codex) prepare the raw material. My prompt, verbatim:

https://www.youtube.com/watch?v=xmkSf5IS-zw

download using yt dlp. create a jpeg screenshot every 30 seconds and save it

also get the transcription for it from dwarkesh’s website if it exists. if not, you can transcribe using the whisper model in bob@isengard

save it in a folder. I want to prepare it for an another agent to prepare it for further processing

That produced the video, 268 frames at 30-second intervals, and Dwarkesh’s published transcript (no Whisper needed). I then handed the folder to a second agent (Fable, in Cursor), which read the transcript, inspected the frames, and redrew the blackboard diagrams as matplotlib figures. My prompt, verbatim:

convert this to a write-up that is faithful to the original. use the video and images if needed. create figures based on what is on the screen and such

The draft first lived as a standalone GitHub document (“just make it standalone, dont add it to my blog”, then “create doc in ~/scratch repo, github markdown. i will read that. make sure it renders nicely”) before I asked for this post (“ok I want you to create a blog post citing the dwarkesh podcast in my blog, linking the youtube and just saying that this is a write-up of the video”). In between I reviewed the output and asked for fixes, for example on the cost figure (“the graph after this: the lines are a bit tight. could we make it clearer?”) and on figure rendering (“also, in the svgs, the space between some things are too much. if you can improve the text rendering in the svgs, it would be great”). ↩

Onur Solmaz · Post · /2026/07/14

Theoretical Upper Bounds for LLM Performance

Note: this post is AI generated, adapted from my working notes behind the Local Frontier calculator. It is a work in progress and may lack rigor in places. If you find an issue in the formulation, please email me at [email protected].

How fast can a machine serve a large language model? I built Local Frontier, an interactive database of local AI hardware and model profiles, to answer that question for any hardware and model pair. This post derives, from first principles, the math the calculator runs. That math is a roofline, an upper bound that says what a given hardware and model pair cannot exceed under a stated set of assumptions. Nothing in it is specific to local machines. The bounds are built from memory capacity and bandwidth alone, so the same formulas cover a MacBook and a datacenter GPU node. The post focuses on local hardware because that is what Local Frontier compares. Its outputs are ceilings for comparing machines, and a real implementation lands somewhere below them.

The argument builds in layers, each created by a problem the previous layer cannot solve. A memory system is two numbers, capacity and bandwidth, and their product turns out to be a natural measure of what the system can fund. That product yields a clean but loose throughput ceiling for batched decoding. The loose ceiling ignores per-session context traffic, so it is too generous for real serving, and repairing it gives the bound the calculator actually uses. The repaired bound is then filled in per architecture by small adapters and checked against real hardware.

Interactive figures accompany the derivation. They all use the same toy setup, a 32B-class dense model at 4-bit (18 GB of weights, 6 GB of runtime overhead, 0.26 GB of KV per thousand tokens of context) on a 128 GB machine with 800 GB/s of sustained bandwidth, so the numbers stay comparable from figure to figure.

Three questions

Serving an LLM raises three separate questions, and mixing them up is an easy way to be wrong about a machine.

Question	What it asks
Resident fit	Can the model plus runtime overhead be held in memory at all?
Single-session speed	What is the memory-side ceiling for one active conversation?
Useful serving throughput	Across many active sessions, how many tokens per second can the device produce while each session stays above a minimum useful rate?

The third question is the hard one. A machine can fit many sessions in memory and still be too slow per session at that concurrency. So the calculator must report both whether sessions fit and whether the fitting sessions are fast enough to be worth running.

Every number produced here is an upper bound. A real system can fall below it for reasons the memory model deliberately ignores, among them compute limits, kernel quality, quantization overhead, scheduling, CPU involvement, paging, interconnects, and thermal throttling. The value of a clean upper bound is that it tells you the best case you are allowed to hope for, and therefore how much room an implementation still has.

Memory power

Before any model-specific detail, ask what a memory system fundamentally offers. When people compare accelerators for local inference they list many specs, from capacity and bandwidth through compute throughput, cache hierarchy, PCIe lanes, and thermals. For the decode phase of autoregressive generation, two of these recur in almost every bound. How much state can the memory hold, and how fast can it move that state? We start there.

Two numbers and their product

Model an idealized memory system by exactly two quantities.

C = \text{usable memory capacity}, \qquad R = \text{sustained memory bandwidth}.

Capacity is a stock, the amount of resident state that can exist at once. Bandwidth is a flow, the amount of state that can be moved per second. They have different units and answer different questions, so any single product of them needs justifying before we rely on it.

Define memory power as

D = C R .

If $C$ is in GB and $R$ is in GB/s, then $D$ is in $\mathrm{GB}^2/\mathrm{s}$ . Power is meant in the colloquial sense of capability, as in computing power, and no watts appear anywhere in this model. The units look strange, so the rest of this section explains why this particular combination of $C$ and $R$ is the right scalar.

For a feel of the scale, a 24 GB GPU at 1000 GB/s has $D = 24 \times 1000 = 24{,}000\ \mathrm{GB}^2/\mathrm{s}$ . One thing to be careful with throughout is that every memory quantity in a given calculation must use the same unit system. The equations are identical in bytes, GB, or bits, and only the numeric value of $D$ changes.

Feasibility theorem

Here is the toy problem that justifies $CR$ . The models in this formulation are deliberately simple, and they are meant as rules of thumb, useful when deciding on hardware for a model or when checking whether an existing setup is getting the most out of the hardware it runs on.

A pure memory workload is a pair $w = (h, r)$ , where $h$ is the resident information it must keep alive and $r$ is the information flow rate it must sustain. The workload is feasible on the system $M(C, R)$ when the memory can both hold the state and carry the traffic. There are exactly two ways to fail.

The first is a capacity failure. If a workload needs more resident state than the memory can store, it cannot fit, and no scheduling trick repairs that.

h \le C .

The second is a flux failure. If a workload needs more traffic per second than the interface can deliver, it cannot be sustained, and no surplus capacity repairs that.

r \le R .

In the idealized model these two conditions are also sufficient. If $h \le C$ and $r \le R$ , allocate $h$ of state and stream at rate $r$ . Therefore the feasible set is precisely the rectangle

F(C, R) = \{(h, r) : 0 \le h \le C,\ 0 \le r \le R\},

whose area is

\mu(F) = C R .

So memory power has a concrete meaning. It is the measure of the feasible workload region. A workload lives at a point $(h, r)$ in the plane, the system can serve every workload inside its rectangle and none outside, and the size of that rectangle is $CR$ .

A caution before building on this. The rectangle is a toy model, and the world is messier in both directions. Real hardware delivers only a fraction of its catalog bandwidth. Even a perfectly sequential read falls short of the spec number, and the achieved fraction depends on the access pattern, on whether enough parallel work is in flight to hide memory latency, and on the hardware itself. The sufficiency claim is idealized too, because the rate a machine reaches depends on which bytes are read and in what order. The bounds below survive this, since real inefficiency only pushes a machine further below its ceiling. The cost lands on comparisons between machines. If one machine sustains 85% of its spec bandwidth and another 60%, ceilings computed from spec numbers make the second machine look better than it really is. Read $R$ as sustained bandwidth where a measured number exists, and read every comparison in this post as approximate.

A caveat on the metric

One misreading is worth heading off. The product $CR$ measures the workload-feasibility region, the set of jobs the device can support at an instant. A different question, “how many distinct memory histories can this device produce over a time $T$ ”, has a different answer, namely the resident information plus the information streamed over that window,

C + R T .

The feasibility-region reading is the one an inference-throughput bound will need, because serving means keeping sessions resident in memory and streaming data for them.

A first decode bound

The feasibility theorem describes static workloads. Decoding is dynamic, emitting tokens over time. This section turns memory power into a throughput ceiling for batched decoding, and in doing so shows where the product $CR$ enters a real performance bound.

A toy decoder

Consider the decode phase of an autoregressive transformer, simplified to the memory system alone. Let

W = \text{model weight footprint}, \qquad K = \text{per-session KV/state footprint},

let $b$ be the batch size (number of concurrent sequences), and let $s$ be the number of decode steps per second. Each decode step emits one token per active sequence, so aggregate output throughput is

T = b \, s .

Throughput is a product of two things, how many sequences run in parallel and how fast the shared model can be swept, and the memory system caps each factor separately.

Capacity limit

The model weights must be resident once, and every active sequence needs its own KV cache, the per-conversation attention state that grows with context length. So the resident state is $W + bK$ , and it must fit.

W + bK \le C \quad\Longrightarrow\quad b \le \frac{C - W}{K} .

This is the capacity limit, and it sets the maximum parallelism. It is exactly the $h \le C$ condition of the feasibility theorem, with $h = W + bK$ .

Bandwidth limit

In a dense transformer, each decode step must apply the model weights. In the memory-bound idealization, applying the weights means streaming roughly $W$ of data per step. With bandwidth $R$ , the step rate obeys

s W \le R \quad\Longrightarrow\quad s \le \frac{R}{W} .

This is the bandwidth limit, and it sets the maximum step rate. It is the $r \le R$ condition, with the per-step traffic playing the role of the flux.

Memory-power decode bound

Multiply the two caps. Throughput is parallelism times step rate, and each is separately bounded, so

T = b\, s \le \frac{C - W}{K}\cdot\frac{R}{W} = \frac{R\,(C - W)}{K W} .

Substituting $D = CR$ and factoring out $C$ exposes the memory-power term.

T \le \frac{D}{K W}\left(1 - \frac{W}{C}\right).

When the model is much smaller than memory, $W \ll C$ , the correction vanishes and the bound collapses to the memorable form

T \lesssim \frac{D}{K W} .

This is the memory-power decode bound. The product $CR$ appears because maximum throughput genuinely factors into maximum parallelism times maximum step rate.

\underbrace{\frac{C}{K}}_{\text{how many sessions resident}}\times \underbrace{\frac{R}{W}}_{\text{how many sweeps per second}} = \frac{CR}{KW} = \frac{D}{KW} .

The numerator $D = CR$ is the machine. The denominator $KW$ is the workload, the per-session state times the model sweep size. The bound reads as throughput is memory power divided by memory cost per active model-token.

Scope of the bound

The memory-power decode bound governs batched throughput. Set $b = 1$ and it degenerates to

T \lesssim \frac{R}{W},

so for a single session only bandwidth matters and capacity is merely a fit constraint. That is the correct behavior, and it makes clear what each metric is for.

Use case	The metric that governs it
Single-user local chat	bandwidth $R$ , with capacity $C$ as a fit gate
Largest model that fits	capacity $C$ first, then bandwidth
Maximum batched decode throughput	memory power $D = CR$

Memory power is the right scalar precisely for the third row. The next section explains why even that bound is too optimistic.

Missing traffic

The memory-power decode bound assumes the only per-token memory traffic worth counting is the model sweep $W$ , shared across the batch. Real decoding also reads each session’s growing KV cache, and that traffic is private, so it does not amortize over the batch. Ignoring it makes the bound promise throughput that long-context serving can never reach. This section introduces the correct per-token accounting and shows the memory-power bound falls out of it as a loose corollary.

Universal resource bound

Step back to the most general statement, which holds regardless of architecture. If each output token costs at least $q_{\min}$ of unavoidable memory traffic and at least $a_{\min}$ of unavoidable compute, and the device delivers at most $R$ bytes/s and $F$ FLOP/s, then over all setups that fit,

T_{\max} \le \max_{\text{setup fits}} \min\!\left(\frac{R}{q_{\min}},\ \frac{F}{a_{\min}}\right).

Two rooflines, and the workload lives under the lower of them. Throughout this post I assume decode is memory-bound, which is the usual case for local serving, and keep the memory roofline,

T_{\max} \le \frac{R}{q},

where $q$ is the memory traffic per output token. Everything now reduces to estimating $q$ honestly. The focus on the memory side is also practical. A hardware catalog can collect capacity and bandwidth consistently across consumer and workstation devices, while comparable sustained-compute numbers are much harder to obtain.

Bytes per token

Split the per-token traffic into the two kinds that behave differently under batching. The model (or active-expert) weights are shared, since one sweep serves the whole batch, so their per-token cost is divided by the batch. The KV/context read is private, since each session reads its own cache, so its per-token cost is not divided at all. With $W_{\mathrm{active}}$ the shared weight traffic per iteration, $\rho$ the tokens emitted per session per iteration (one for ordinary decoding), and $K_{\mathrm{read}}(L)$ the private context traffic per output token at active context length $L$ ,

q_{\text{simple}}(b) = \frac{W_{\mathrm{active}}}{b\,\rho} + K_{\mathrm{read}}(L).

The first term shrinks as the batch grows, because more sessions share each weight sweep. The second term does not move. A larger batch does not make any session’s context cheaper to read, and this asymmetry is the main reason long-context serving behaves differently from short-context serving.

Interactive figure: bytes per output token as the batch and the read context change. Enable JavaScript to explore it.

Memory-power bound as a corollary

Treat $q_{\text{simple}}$ as an optimistic lower estimate of the true bytes per token, $q_{\text{simple}}(b) \le q_{\text{actual}}(b)$ . Dividing $R$ by a smaller denominator gives a larger quotient, so substituting $q_{\text{simple}}$ keeps the result an upper bound.

T_{\max} \le \max_{b\,:\,W_{\mathrm{resident}} + b K_{\mathrm{store}} + O \le C}\ \frac{R}{\dfrac{W_{\mathrm{active}}}{b\rho} + K_{\mathrm{read}}(L)} .

Here the maximization runs over batches that fit in memory, with $W_{\mathrm{resident}}$ the resident model footprint, $K_{\mathrm{store}}$ the KV memory stored per session, and $O$ the runtime overhead. This is the honest simple bound, bandwidth divided by shared-per-token cost plus private-per-token cost, maximized over fitting batches.

Now recover the previous section’s bound by deliberately throwing information away. Since $K_{\mathrm{read}}(L) \ge 0$ , dropping it only loosens the denominator.

\frac{R}{\dfrac{W_{\mathrm{active}}}{b\rho} + K_{\mathrm{read}}(L)} \le \frac{R}{\dfrac{W_{\mathrm{active}}}{b\rho}} = \frac{b\,\rho\,R}{W_{\mathrm{active}}} .

The capacity constraint caps the batch at $b \le (C - W_{\mathrm{resident}} - O)/K_{\mathrm{store}}$ , so

T_{\max} \le \rho\,\frac{R\,(C - W_{\mathrm{resident}} - O)}{K_{\mathrm{store}}\,W_{\mathrm{active}}} = \rho\,\frac{D}{K_{\mathrm{store}}\,W_{\mathrm{active}}}\left(1 - \frac{W_{\mathrm{resident}} + O}{C}\right).

This is the memory-power decode bound again. It is what you get by discarding the private context term and using the largest batch that fits. That is why it can sit far above achievable throughput while remaining a true ceiling. The ordering is

\text{actual throughput} \;\le\; \text{simple (KV-aware) bound} \;\le\; \text{memory-power bound}.

The memory-power bound is the orientation line. The KV-aware bound is the one to serve from, and the next section develops it into the operational calculator.

KV-aware bound

This section turns the simple bytes-per-token bound into the model the calculator runs. Three refinements are needed. Context must be split into the part that controls memory and the part that controls speed. The shared weight traffic must be allowed to grow with the batch, which matters for mixture-of-experts (MoE) models, models that route each token through a small subset of their weights. And the batch must be filtered so that we never count concurrency at which every session has become uselessly slow.

Two context lengths

A serving system usually reserves KV space for a long maximum context but reads, on average, a shorter active context. These two lengths drive different parts of the bound, so we keep them separate.

\begin{aligned} L_{\mathrm{alloc}} &= \text{reserved (maximum) context}, \\ L_{\mathrm{read}} &= \text{average active context}, \\ L_{\mathrm{read}} &\le L_{\mathrm{alloc}}. \end{aligned}

The allocation length controls how much KV memory each session reserves, and therefore how many sessions fit. The read length controls how much context each output token must stream, and therefore per-token cost. Collapsing them into one number either overcharges memory or overcharges speed.

Model quantities

A model contributes five quantities. Two are about fitting, two are about speed, and one is about decoding style.

Symbol	Role
$W_{\mathrm{resident}}$	Full resident footprint (for MoE, all resident weights, including the inactive experts)
$W_{\mathrm{batch}}(b)$	Shared weight traffic per decode iteration at batch $b$
$K_{\mathrm{alloc}}(L_{\mathrm{alloc}})$	KV/cache memory reserved per session, which controls concurrency
$K_{\mathrm{read}}(L_{\mathrm{read}})$	Private context traffic per output token, which controls decode cost
$\rho$	Tokens emitted per session per iteration ( $\rho = 1$ ordinary, $\rho > 1$ speculative)

The split between $K_{\mathrm{alloc}}$ and $K_{\mathrm{read}}$ mirrors the split between the two context lengths. Allocation controls how many sessions fit, read controls how fast each one decodes. Note that $W_{\mathrm{batch}}$ now depends on the batch size $b$ , for a reason the adapter section explains.

Memory-fit batch

The first gate is whether sessions fit. Load the model, reserve overhead, and divide the remainder by the per-session allocation.

b_{\mathrm{mem}}(L_{\mathrm{alloc}}) = \left\lfloor \frac{C - W_{\mathrm{resident}} - O}{K_{\mathrm{alloc}}(L_{\mathrm{alloc}})} \right\rfloor .

This is memory-fit concurrency only. It is necessary but not sufficient, and it is exactly the trap that makes a machine look like it can serve a hundred sessions when it cannot serve them usefully.

Interactive figure: how many sessions fit in memory as capacity and reserved context change. Enable JavaScript to explore it.

Aggregate and per-session ceilings

The per-token traffic is the shared weight sweep amortized over the emitted tokens, plus the private context read.

q_{\mathrm{KV}}(b, L_{\mathrm{read}}) = \frac{W_{\mathrm{batch}}(b)}{b\,\rho} + K_{\mathrm{read}}(L_{\mathrm{read}}) .

Then the memory roofline $T \le R/q$ gives the aggregate ceiling at batch $b$ ,

T(b, L_{\mathrm{read}}) \le \frac{R}{q_{\mathrm{KV}}(b, L_{\mathrm{read}})},

and dividing by the batch gives the per-session rate,

r(b, L_{\mathrm{read}}) = \frac{T(b, L_{\mathrm{read}})}{b} = \frac{\rho R}{W_{\mathrm{batch}}(b) + b\,\rho\,K_{\mathrm{read}}(L_{\mathrm{read}})} .

As $b$ grows, the aggregate $T$ rises but the per-session $r$ falls. That tension is the whole serving tradeoff, and it is why a fit-only bound is not enough.

Usable-batch correction

The fix is to refuse batches at which a session would crawl. Impose a per-session floor $r_\star$ , the minimum useful tokens/s/session, and solve $r(b) \ge r_\star$ for $b$ . Replacing the batch-dependent $W_{\mathrm{batch}}(b)$ by its shared lower bound $W_{\mathrm{active}}$ keeps a closed form. Because $W_{\mathrm{batch}}(b) \ge W_{\mathrm{active}}$ , the substitution only weakens the condition, so the implication runs one way, and the closed form is a necessary condition on the admissible batch rather than a sufficient one.

r(b) \ge r_\star \quad\Longrightarrow\quad b \le \frac{\rho R / r_\star - W_{\mathrm{active}}}{\rho\,K_{\mathrm{read}}(L_{\mathrm{read}})} ,

which defines a rate-limited batch

b_{\mathrm{rate}}(L_{\mathrm{read}}, r_\star) = \left\lfloor \frac{\rho R / r_\star - W_{\mathrm{active}}}{\rho\,K_{\mathrm{read}}(L_{\mathrm{read}})} \right\rfloor .

The usable batch is whichever gate binds first.

b_{\mathrm{usable}} = \min\!\big(b_{\mathrm{mem}}(L_{\mathrm{alloc}}),\ b_{\mathrm{rate}}(L_{\mathrm{read}}, r_\star)\big).

Because $b_{\mathrm{rate}}$ comes from a necessary condition, $b_{\mathrm{usable}}$ is itself an upper bound on the truly admissible batch, and the calculator applies the exact floor test with the true $W_{\mathrm{batch}}(b)$ in the next step. This is what stops the “hundred sessions” illusion. As context grows, $K_{\mathrm{read}}(L)$ grows, so $b_{\mathrm{rate}}$ falls quickly even while $b_{\mathrm{mem}}$ stays large. The KV slots fit while the useful rate does not.

Interactive figure: aggregate and per-session ceilings against batch size, with the memory gate and the per-session floor. Enable JavaScript to explore it.

The bound the calculator uses

Collecting the pieces, define the usable batch set as the fitting batches that also clear the floor,

\mathcal{B}(L_{\mathrm{alloc}}, L_{\mathrm{read}}, r_\star) = \left\{ b : 1 \le b \le b_{\mathrm{mem}}(L_{\mathrm{alloc}}),\ \frac{1}{b}\,\frac{R}{q_{\mathrm{KV}}(b, L_{\mathrm{read}})} \ge r_\star \right\},

and take the best aggregate over that set.

T_{\max}(L_{\mathrm{alloc}}, L_{\mathrm{read}}, r_\star) \le \max_{b \in \mathcal{B}}\ \frac{R}{\dfrac{W_{\mathrm{batch}}(b)}{b\,\rho} + K_{\mathrm{read}}(L_{\mathrm{read}})} .

This is the KV-aware bound, the main practical formulation. In words, try every batch that fits, reject the ones too slow per session, and for the rest take bandwidth divided by bytes per output token, keeping the best.

The looser memory-power bound is its corollary, obtained as before by dropping the private term and using the largest fitting batch.

T_{\max} \le \rho\,\frac{D}{K_{\mathrm{alloc}}(L_{\mathrm{alloc}})\,W_{\mathrm{active}}}\left(1 - \frac{W_{\mathrm{resident}} + O}{C}\right), \qquad D = CR.

The two stand in a fixed relation, which is the main result of the derivation, written first by name and then in full.

\begin{aligned} T_{\max} \;&\le\; \text{KV-aware bound} \;\le\; \text{memory-power bound} \;\le\; \text{large-memory limit} \\ T_{\max} \;&\le\; \max_{b \in \mathcal{B}} \frac{R}{\dfrac{W_{\mathrm{batch}}(b)}{b\rho} + K_{\mathrm{read}}(L_{\mathrm{read}})} \\ \;&\le\; \rho\,\frac{D}{K_{\mathrm{alloc}}(L_{\mathrm{alloc}})\,W_{\mathrm{active}}} \left(1 - \frac{W_{\mathrm{resident}} + O}{C}\right) \\ \;&\le\; \rho\,\frac{D}{K_{\mathrm{alloc}}(L_{\mathrm{alloc}})\,W_{\mathrm{active}}} \end{aligned}

The gap across these terms is the point of the whole derivation. The KV-aware line is the tight, practical bound. The memory-power line shows the memory system’s large theoretical capacity-bandwidth product, and the distance between them comes from private context traffic, expert diversity, and the per-session floor. The final term drops the resident-model factor as well, so the right-hand side is exactly the simplified $D/(KW)$ from the memory-power decode bound, now with $K = K_{\mathrm{alloc}}(L_{\mathrm{alloc}})$ and $W = W_{\mathrm{active}}$ . It is the loosest, most optimistic reading, since the resident model and overhead always claim a real share of $C$ .

Single session

Set $b = 1$ to recover the latency-style bound for one conversation. There is no batch to amortize the weight sweep over.

T_{1,\max} \le \frac{R}{W_{\mathrm{batch}}(1)/\rho + K_{\mathrm{read}}(L_{\mathrm{read}})},

and for ordinary decoding ( $\rho = 1$ ) this is just $R / (W_{\mathrm{batch}}(1) + K_{\mathrm{read}}(L_{\mathrm{read}}))$ . Capacity has dropped out except as the gate that decides whether the model fits at all, consistent with the earlier observation that bandwidth governs single-session speed.

Speculative decoding

Speculative decoding lets $\rho > 1$ . A draft model proposes several tokens and the target verifies them in one iteration, so $\rho$ is the expected number of accepted tokens per session per verification step, bounded by the draft length $\gamma$ as $1 \le \rho \le \gamma + 1$ . The temptation is to multiply throughput by $\rho$ and stop. That is wrong, because the draft model and verification are not free. Their traffic belongs in $W_{\mathrm{batch}}(b)$ or in the per-token term. The safe rule is to never scale by $\rho$ without charging the draft cost in the denominator. With both effects included, speculative decoding moves through the same KV-aware formula unchanged.

The two bounds side by side

The gap between the memory-power bound and the KV-aware bound is easiest to see as memory traffic. Below, two copies of the same machine decode side by side. Each board is the machine’s usable memory, with the weights packed into an orange container of equal-sized cells and each session’s blue KV cells packed into a small container of its own.

Each decode iteration must move every byte that its accounting charges, at the same bandwidth on both machines, so the charged cells light up one by one and the board resets when the iteration completes. The left board charges only the shared weight cells, so it resets quickly and its token counter races ahead. The right board also charges the read cells of every session’s context, so its iterations stretch as the batch and the context grow. A real iteration takes milliseconds, so time runs in slow motion here.

Interactive animation: memory-power accounting and KV-aware accounting decoding side by side on the same machine. Enable JavaScript to watch it.

Both boards run on the same silicon at the same bandwidth, and only the bookkeeping differs between them. The left counter is the memory-power accounting at the chosen batch, and the right one is what the KV-aware bound admits once private context reads are charged.

The sliders show the two context lengths at work. The reserved context sets how many cells each session’s container holds and can push the machine past its capacity, so raising it eventually makes the containers stop fitting. The read context sets how many of those cells light up every iteration, and the longer it gets the smaller the weights’ share of each iteration becomes. It follows the reservation at an adjustable fraction, 90 percent by default, with the unread rest capped at 32k, and the sliders for all of this sit under Advanced. Growing the model itself slows both boards down in step, while the orange container eats the room the blue ones need.

The full memory-power bound goes one step further than the left board. It grows the batch until memory is completely full of KV cells, which is exactly what the default maximize batch mode does, and the line under the left board reports that number. Shrinking the reserved context makes it explode.

At a 4k reservation about a hundred sessions fit and the bound climbs past 4,000 tok/s on this toy machine, and at a 1k reservation it would pass 17,000. Those numbers are true ceilings and useless forecasts at the same time. What stops a real machine long before then is reading each session’s context, which is exactly the traffic the right board charges.

Model adapters

The KV-aware bound is architecture-agnostic, and an architecture enters only through the five quantities. So each model family is captured by a small adapter that supplies them.

\mathrm{Adapter}(M) = \big[\, W_{\mathrm{resident}},\ W_{\mathrm{batch}}(b),\ K_{\mathrm{alloc}}(L_{\mathrm{alloc}}),\ K_{\mathrm{read}}(L_{\mathrm{read}}),\ \rho \,\big].

Three adapters cover the catalog. They handle dense transformers, mixture-of-experts, and hybrid/sliding/recurrent attention.

Dense transformers

For a dense model with $P_{\mathrm{total}}$ parameters at $e_w$ bytes each, all weights are touched every step, so the resident footprint and the per-step sweep coincide.

W_{\mathrm{resident}} = W_{\mathrm{batch}}(b) = P_{\mathrm{total}}\, e_w .

With $N_{\mathrm{layers}}$ layers, $N_{\mathrm{kv}}$ key/value heads, head dimension $d_h$ , and KV byte widths $e_{\mathrm{KV,store}}$ and $e_{\mathrm{KV,read}}$ , full-context attention reserves and reads

K_{\mathrm{alloc}}(L) = 2\, N_{\mathrm{layers}}\, N_{\mathrm{kv}}\, d_h\, e_{\mathrm{KV,store}}\, L, \qquad K_{\mathrm{read}}(L) \approx 2\, N_{\mathrm{layers}}\, N_{\mathrm{kv}}\, d_h\, e_{\mathrm{KV,read}}\, L,

where the factor $2$ counts keys and values. Weight precision and KV precision are independent settings. NVFP4 weights (NVIDIA’s 4-bit floating-point format) do not imply an NVFP4 cache, so $e_w$ and $e_{\mathrm{KV}}$ are tracked separately.

Mixture-of-experts

An MoE model is where the constant- $W_{\mathrm{batch}}$ assumption breaks, and fixing it is the single most important adapter correction. Let $P_{\mathrm{total}}$ be total parameters, $P_{\mathrm{active}}$ the active parameters per token, $E$ the number of routed experts, and $k$ the experts selected per token. Assuming uniformly sized routed experts, each routed expert holds

p_{\mathrm{expert}} = \frac{P_{\mathrm{total}} - P_{\mathrm{active}}}{E - k},

and the always-on remainder (dense trunk, shared experts, embeddings, attention) is

P_{\mathrm{fixed}} = P_{\mathrm{active}} - k\, p_{\mathrm{expert}} .

The naive model assumes a batch touches the same active experts every session, keeping $W_{\mathrm{batch}}$ constant. That is false. Independent sessions route to different experts, so a larger batch touches more distinct experts. With $b\rho$ token-routings, each independently missing a given expert with probability $1 - k/E$ , the expected number of distinct experts touched is

m(b\rho) = E\left(1 - \left(1 - \frac{k}{E}\right)^{b\rho}\right),

and the per-iteration shared traffic is the fixed part plus the touched experts.

W_{\mathrm{batch}}(b) = e_w\big[\, P_{\mathrm{fixed}} + p_{\mathrm{expert}}\, m(b\rho) \,\big].

At $b\rho = 1$ this reduces to the active-parameter footprint, and as $b\rho \to \infty$ it saturates at all $E$ experts. This rising $W_{\mathrm{batch}}(b)$ is why MoE batching does not amortize for free, and why the MoE rows in the worked table reach their throughput optimum at modest batch sizes.

One caveat applies here. Every other traffic term in the bound is a deliberate under-estimate of real traffic, which is what makes $R/q$ a true ceiling. The expert count $m(b\rho)$ is different. It is an expectation under independent, uniform routing rather than a lower bound. Real routing is correlated, since load-balancing losses push toward uniform while hot experts and topically similar sessions pull the other way, and correlated routing touches fewer distinct experts than the formula predicts. In that case the modeled traffic overstates the actual traffic, and the computed ceiling can sit below the true one. When a guaranteed ceiling is required, replace $m(b\rho)$ by its minimum $k$ , which replaces $W_{\mathrm{batch}}(b)$ by $W_{\mathrm{active}}$ . The expectation form is the better estimate, the floor form is the safe bound.

Hybrid, sliding, and recurrent attention

Models with local or sliding-window attention, compressed or latent attention, or linear/recurrent state must not use the full-KV formula blindly, because their cache does not grow linearly in $L$ everywhere. Split both KV terms into global, local, and fixed-state parts.

K_{\bullet}(L) = K_{\mathrm{global},\bullet}(L) + K_{\mathrm{local},\bullet}(L) + K_{\mathrm{state},\bullet}, \qquad \bullet \in \{\mathrm{alloc}, \mathrm{read}\}.

A simple read approximation with sliding-window width $w$ is

K_{\mathrm{read}}(L) = \kappa_{\mathrm{global}}\, L + \kappa_{\mathrm{local}}\,\min(L, w) + K_{\mathrm{state}} .

Full-attention layers pay for the whole context, sliding-window layers pay only up to the window, and recurrent or latent state adds a fixed or slowly growing term. The same shape covers Gemma-style local/global attention and DeepSeek-style compressed/sparse attention, with only the coefficients changing.

Calculator procedure

The bound is now ready to compute. Because the same computation runs for every hardware-and-model pair, it is worth stating once as a procedure.

The inputs are a hardware row $(C, R, O)$ , a model adapter $(W_{\mathrm{resident}}, W_{\mathrm{batch}}(\cdot), K_{\mathrm{alloc}}(\cdot), K_{\mathrm{read}}(\cdot), \rho)$ , and workload assumptions $(L_{\mathrm{alloc}}, L_{\mathrm{read}}, r_\star)$ .

Compute the resident margin $C - W_{\mathrm{resident}} - O$ . If it is negative, the model does not fit, so stop.
Compute $K_{\mathrm{alloc}}(L_{\mathrm{alloc}})$ and the memory-fit batch $b_{\mathrm{mem}}$ .
For each integer batch $1 \le b \le b_{\mathrm{mem}}$ , compute $W_{\mathrm{batch}}(b)$ and $q_{\mathrm{KV}}(b, L_{\mathrm{read}})$ .
Compute the aggregate ceiling $R / q_{\mathrm{KV}}$ and the per-session ceiling $R / (b, q_{\mathrm{KV}})$ at each batch.
Keep the batches whose per-session ceiling is at least $r_\star$ .
Among the kept batches, choose the one with the largest aggregate ceiling, and report it as the KV-aware result with its batch as $b^\star$ .
Separately compute the memory-power ceiling for orientation.

The output is a stack of gates, and the right phrasing depends on which gate bound.

State	Meaning
Resident fit	The model plus overhead fits in memory
Session fit	At least one reserved-context session fits
Floor fit	Some fitting batch clears $r_\star$
No floor	Sessions fit, but no batch clears $r_\star$

The common invalid reading is that fitting in memory implies serving usefully. A model can pass resident fit and session fit and still have an empty usable batch set, because every fitting batch is below the floor. The honest report for that case is “fits, but no batch satisfies the floor”, a distinct verdict from a true fit failure. Keeping the two apart is the reason the floor gate exists.

Worked examples

Now check the theory against real hardware. Consider two 128 GB machines, one bandwidth-rich and one bandwidth-poor, which isolate the effect of $R$ at fixed $C$ .

Two machines

NVIDIA’s DGX Spark, a small desktop AI machine, carries 128 GB of LPDDR5x unified memory at 273 GB/s. An Apple M5 Max with a 40-core GPU reaches 614 GB/s and is configurable to 128 GB of unified memory. At equal capacity their memory powers are

Hardware	$C$	$R$	$D = CR$
DGX Spark	128 GB	273 GB/s	34,944 $\mathrm{GB}^2/\mathrm{s}$
Apple M5 Max 128GB	128 GB	614 GB/s	78,592 $\mathrm{GB}^2/\mathrm{s}$

Both bandwidth numbers are catalog figures rather than measured sustained rates, so the earlier caveat applies. If one machine sustains a larger share of its spec than the other, the comparison will make the other machine look better than it really is.

Three models

Three MoE models, modeled from their published cards as adapter parameters, without re-measurement.

Qwen3.6-35B-A3B, with 35B total / 3B active parameters, $E = 256$ routed experts, $k = 8$ routed (plus one shared) per token, and weights quantized NVFP4.
Gemma 4 26B-A4B-it, with 26B total / 4B active, $E = 128$ , top-8 routing, hybrid local/global attention, and NVFP4 weights.
DeepSeek V4 Flash (DS4), with 284B total / 13B active, $E = 256$ routed plus one shared, $k = 6$ per token, million-token context via compressed/sparse attention, and weights at a Q2-style mixed quantization.

All rows use the Local Frontier defaults, namely reserved context $L_{\mathrm{alloc}} = 100{,}000$ , active context $L_{\mathrm{read}} = 32{,}000$ , per-session floor $r_\star = 20$ tok/s/session, ordinary decoding $\rho = 1$ , runtime overhead $O = 8$ GB, and the memory roofline only. The numbers are memory-side upper bounds from the simplified adapters, to be read as ceilings for comparing hardware.

Results

Hardware	Model	Single-session	$b^\star$	KV-aware aggregate	Memory-power ceiling
DGX Spark	Qwen3.6-35B-A3B	$\le$ 149 tok/s	17	$\le$ 345 tok/s	$\le$ 18.2k tok/s
DGX Spark	Gemma 4 26B-A4B-it	$\le$ 120 tok/s	16	$\le$ 333 tok/s	$\le$ 23.7k tok/s
DGX Spark	DeepSeek V4 Flash (Q2)	$\le$ 83 tok/s	7	$\le$ 154 tok/s	$\le$ 13.1k tok/s
Apple M5 Max 128GB	Qwen3.6-35B-A3B	$\le$ 336 tok/s	50	$\le$ 1,006 tok/s	$\le$ 41.0k tok/s
Apple M5 Max 128GB	Gemma 4 26B-A4B-it	$\le$ 271 tok/s	66	$\le$ 1,326 tok/s	$\le$ 53.3k tok/s
Apple M5 Max 128GB	DeepSeek V4 Flash (Q2)	$\le$ 188 tok/s	22	$\le$ 446 tok/s	$\le$ 29.5k tok/s

Two things stand out. First, with capacity held equal, the higher M5 Max bandwidth lifts the single-session ceilings in proportion to $R$ , and the batched ceilings by even more, because the extra bandwidth also lets more sessions clear the per-session floor. The bandwidth-rich machine wins exactly where the theory says it should, in batched throughput. Second, the memory-power column sits one to two orders of magnitude above the KV-aware column. That gap is the cost of private context traffic and expert diversity, and showing it is the point of the derivation.

Forced concurrency

What if concurrency is fixed by policy rather than chosen at the floor-satisfying optimum? On DGX Spark, pushing past $b^\star$ buys aggregate throughput at the cost of per-session rate.

Model	Batch $b$	Aggregate ceiling	Per-session ceiling
Qwen3.6-35B-A3B	32	$\le$ 394 tok/s	$\le$ 12.3 tok/s/session
Qwen3.6-35B-A3B	64	$\le$ 478 tok/s	$\le$ 7.5 tok/s/session
Gemma 4 26B-A4B-it	32	$\le$ 430 tok/s	$\le$ 13.4 tok/s/session
Gemma 4 26B-A4B-it	64	$\le$ 575 tok/s	$\le$ 9.0 tok/s/session

This is the serving tradeoff in numbers. For DGX Spark under these assumptions, 32 and 64 sessions are too high if the goal is around 20 tok/s/session, exactly the regime the usable-batch correction is built to reject, and the reason $b^\star$ for these models settles near 16.

A sanity check against a real run

A reported DGX Spark run served Gemma at concurrency 16 at roughly 16 to 18 tok/s/session, an aggregate of $16 \times 16 = 256$ to $16 \times 18 = 288$ tok/s. The KV-aware aggregate ceiling for Gemma at this batch is $\le 333$ tok/s, so observed throughput is

\frac{256}{333} \approx 77\% \quad\text{to}\quad \frac{288}{333} \approx 86\%

of the simplified ceiling. That is close enough to suggest the implementation is near the memory-side roofline. It does not prove the quantization is optimal. The bound omits compute, scheduler behavior, kernel details, and exact cache traffic, and proving optimality would require profiler evidence of bandwidth saturation with no compute, scheduler, or CPU stalls. A bound this close simply means there is little memory headroom left to capture.

Omitted rooflines

The memory roofline is one limit among several. Real throughput is the minimum over all of them,

T_{\max} \le \min\big(T_{\mathrm{memory}},\ T_{\mathrm{compute}},\ T_{\mathrm{kernel}},\ T_{\mathrm{scheduler}},\ T_{\mathrm{interconnect}}\big),

and the memory term we computed can be undercut by compute throughput and tensor-core utilization, dequantization kernels, attention kernels and KV layout, prefill/decode phase mixing, scheduler overhead and request churn, CPU and PCIe involvement, multi-GPU communication, allocator fragmentation, thermal and power limits, tokenization and sampling, speculative rejection rates, and prefix-cache hit rates. Each belongs as its own limit term. The memory model remains useful because it makes the first unavoidable ceiling explicit and cheap to compute, and because for memory-bound decode it is usually the binding one.

One recurring caution applies to models with recurrent or linear state. A model with tiny fixed state and tiny private read traffic produces an enormous memory-side aggregate at high concurrency, because almost nothing in the denominator grows with the batch. That is precisely the signal that compute, kernel, scheduler, and recurrent-state details must be added before the aggregate number is treated as realistic. The memory bound describes the best case the hardware allows, and reaching it is the implementation’s job.

Cheat sheet

This section collects the whole formulation in one place, so it can be read on its own. A machine is three numbers and a model adapter is five, with the workload adding three assumptions.

Symbol	Meaning
$C$ , $R$ , $O$	Usable memory capacity, sustained memory bandwidth, runtime overhead
$W_{\mathrm{resident}}$	Full resident weight footprint, which must fit in memory
$W_{\mathrm{batch}}(b)$	Shared weight traffic per iteration, equal to $W_{\mathrm{resident}}$ for dense models and growing with $b$ for MoE
$K_{\mathrm{alloc}}(L_{\mathrm{alloc}})$	KV memory reserved per session
$K_{\mathrm{read}}(L_{\mathrm{read}})$	Private context traffic per output token
$\rho$	Tokens emitted per session per iteration, one for ordinary decoding
$L_{\mathrm{alloc}}$ , $L_{\mathrm{read}}$	Reserved and average read context, $L_{\mathrm{read}} \le L_{\mathrm{alloc}}$
$r_\star$	Minimum useful tokens/s per session

Everything descends from the memory roofline. If each output token must move at least $q$ bytes and the machine delivers at most $R$ bytes per second, then

T_{\max} \le \frac{R}{q} .

The first gate is whether sessions fit. Load the weights, reserve the overhead, and divide what is left by the per-session KV allocation.

b_{\mathrm{mem}}(L_{\mathrm{alloc}}) = \left\lfloor \frac{C - W_{\mathrm{resident}} - O}{K_{\mathrm{alloc}}(L_{\mathrm{alloc}})} \right\rfloor .

Per-token traffic is the shared weight sweep amortized over the batch plus the private context read, which no batch size amortizes.

q(b, L_{\mathrm{read}}) = \frac{W_{\mathrm{batch}}(b)}{b\,\rho} + K_{\mathrm{read}}(L_{\mathrm{read}}) .

The roofline gives the aggregate ceiling at batch $b$ , and dividing by the batch gives the per-session rate. The aggregate rises with $b$ while the per-session rate falls.

T(b) \le \frac{R}{q(b, L_{\mathrm{read}})}, \qquad r(b) = \frac{T(b)}{b} = \frac{\rho R}{W_{\mathrm{batch}}(b) + b\,\rho\,K_{\mathrm{read}}(L_{\mathrm{read}})} .

Imposing the floor $r_\star$ rejects the batches where every session crawls, and the usable batch is whichever gate binds first.

b_{\mathrm{rate}}(L_{\mathrm{read}}, r_\star) = \left\lfloor \frac{\rho R / r_\star - W_{\mathrm{active}}}{\rho\,K_{\mathrm{read}}(L_{\mathrm{read}})} \right\rfloor, \qquad b_{\mathrm{usable}} = \min\big(b_{\mathrm{mem}},\ b_{\mathrm{rate}}\big) .

The usable batch set holds the batches that fit and clear the floor, and the KV-aware bound is the best aggregate over it.

\mathcal{B} = \left\{ b : 1 \le b \le b_{\mathrm{mem}},\ \frac{R}{b\, q(b, L_{\mathrm{read}})} \ge r_\star \right\}, \qquad T_{\max} \le \max_{b \in \mathcal{B}} \frac{R}{q(b, L_{\mathrm{read}})} .

Setting $b = 1$ in the same formula gives the single-session ceiling. Dropping the private context term and taking the largest fitting batch gives the looser memory-power ceiling,

T_{\max} \le \rho\,\frac{D}{K_{\mathrm{alloc}}(L_{\mathrm{alloc}})\,W_{\mathrm{active}}}\left(1 - \frac{W_{\mathrm{resident}} + O}{C}\right), \qquad D = CR ,

and the three levels always order the same way.

\text{actual throughput} \;\le\; \text{KV-aware bound} \;\le\; \text{memory-power bound} .

Every number these formulas produce is an upper bound built from memory capacity and bandwidth alone. Real implementations land below it, and compute, software overhead, and interconnects can only lower the ceiling further.

When reporting a result, always state the assumptions that move it, namely $L_{\mathrm{alloc}}$ , $L_{\mathrm{read}}$ , $\rho$ , $r_\star$ , the weight precision, and the KV-cache precision or attention adapter. Without them, a single tok/s number is not reproducible.

To see these bounds computed live for hundreds of audited model profiles against a catalog of local hardware, try the Local Frontier calculator.

@onusoz · /2026/07/13 · 06:02 PM View on

I'm tired of AI writing 👃 smell 👃 here on X (and also the docs I generate) so I made a skill to 👃 de-smell 👃 Steal the skill, especially those who use GPT 5 to generate their long-form posts, for example 🤗 @analogalok @sudoingX @aijoey @ishaansehgal @stevibe and even @TheAhmadOsman 🤗 Surprisingly, @levelsio, a.k.a. AI reply guys' final boss, is the most *human* poster in the "sentence flow" metric that fable came up in an autoresearch loop. Very cool (though I haven't sampled everyone, just long-formish writers my xtap extension has picked up) I know I'm doing a favor to slop vendors out there. But there are people who are actually doing legit work, but then passing all their thoughts through GPT before putting it out here. I *have* to follow them due to my occupation. If this will save me from another "it's not X, it's Y", or "it is A, B and C", I will do it 😤 My de-smelled slop post has all the info, graphs and all: The skill (will keep updating it): This is not the end of this work, I am just beginning

Image hidden

@onusoz · /2026/07/13 · 12:39 PM View on

don't tease us 😩

@appltrack · Jul 12, 2026

Apple's M7 Ultra chip coming in 2029 is rumored to support 1.5TB of RAM. This would make the processor much more capable for on-device AI.

Image hidden

@onusoz · /2026/07/13 · 08:15 AM View on

I've started using Cursor since a few days now, and I have to say the experience is really pleasant. I had not touched it since Claude Code came out in May 2025, more than 1 year ago. I had even deleted it fully last year as an editor and replaced it with Zed, after it had become very bloated and unstable I can say it no longer feels bloated/unstable and that they have transitioned successfully to the agentic era Using it mostly for fable. Using it locally in the desktop app, and also through acpx, to make codex ask fable for review Thank you @cursor_ai for sponsoring my Ultra plan!

@onusoz · Jul 13, 2026

I told it only once before to do it on the session, and now codex habitually asks fable for review on data models and plans it creates through acpx (fable inside cursor) I then asked it what it thinks about the feedback from fable. gpt-5.6-sol is not impressed

Image hidden

@onusoz · /2026/07/13 · 07:45 AM View on

I told it only once before to do it on the session, and now codex habitually asks fable for review on data models and plans it creates through acpx (fable inside cursor) I then asked it what it thinks about the feedback from fable. gpt-5.6-sol is not impressed

Image hidden

@onusoz · /2026/07/13 · 07:15 AM View on

Reminder that this fever dream of a podcast exists with @dwarkesh_sp and @_sholtodouglas youtube.com/watch?v=3XDad4…

Image hidden

@onusoz · /2026/07/13 · 07:11 AM View on

new satisfaction unlocked: wake up to an agent run that finishes seconds after you open the laptop

Onur Solmaz · Post · /2026/07/13

Building an AI de-smeller

Note: this post is fully AI generated, written by Claude Fable 5 through Cursor. It was written interactively, in a back and forth with an agent, under the very kill-ai-smell skill it describes, as a demonstration of that skill.

Everyone knows the feeling by now. You open a page, read two paragraphs, and something tells you a model wrote it. The tells have become cultural shorthand, with the em dash as the poster child, but most of the discourse stays at the level of vibes. I wanted to know whether the feeling corresponds to anything you can measure, so my agent and I ran a small stylometric study, and the answer turned out to be a clear yes. A handful of surface metrics, all computable with regular expressions and a sentence splitter, separate generated copy from human writing by an order of magnitude, in several cases with no overlap between the groups at all.

This post walks through the corpus, the metrics, and the numbers, and ends with the kill-ai-smell skill that came out of the exercise.

The corpus

For the AI side I needed pages that read as generated in the wild, and I had convenient specimens close to home. Ten project sites from the OpenClaw ecosystem (crabbox.sh, mcporter.sh, gitcrawl.sh, clawpatch.ai, fs-safe.io, spogo.sh, imsg.sh, wacli.sh, gogcli.sh and goplaces.sh) have landing copy written largely by agents, and I decided to keep those pages as they are and use them as data. Their prose comes to 4,853 words after stripping code blocks. These pages were written by GPT 5.5, so the measurements characterize that model’s copy. They do not necessarily generalize to other models, whether Claude or open-weight ones like GLM and Kimi.

For the human side I wanted texts that are provably human, which means they had to be frozen before language models could have touched them. Five are essays and documentation: the SQLite testing documentation, Joel Spolsky’s 2000 essay “Things You Should Never Do”, a 2018 antirez blog post, Paul Graham’s 2009 “Maker’s Schedule, Manager’s Schedule”, and Julia Evans’ 2019 “Get your work recognized: write a brag document”. The other three attack the register objection directly, because landing copy and essays are different animals regardless of author. They are the ripgrep README at its 2016 tag, the Redis README at 3.2.0 from the same year, and the Requests README at v2.13.0 from 2017, each taken at an old git tag whose commit history proves its date. A README that sells a tool is the fairest comparison for a landing page that sells a tool. The human set comes to 15,317 words, and every rate below is normalized per 1,000 words so the corpora are comparable.

The exact texts I measured are archived in the ai-smell repository and listed in the appendix, so anyone can rerun the numbers against the same input.

The metrics

The measuring script strips code, splits sentences, and counts things. There is no model in the loop and no judgment call in any metric. That is the point of the exercise. If the tells are real, they should be detectable by grep, and anyone should be able to reproduce the numbers. The scripts, the corpus, and the figures live in the ai-smell repository.

The chart below shows every document against every metric, with the AI pages in orange and the human baselines in blue. The prose after it sticks to the ratios, because the ratios are the story, and the raw ranges are collapsed underneath for anyone who wants to check them.

Raw ranges per metric

Metric	AI set (10 docs)	Human set (8 docs)
Em dashes /1k words	0.0–61.3	0.0–4.7
Exactly-three lists /1k words	6.3–15.9	0.0–2.0
Labeled bullets, % of all bullets	53–100%	0–11%
Fragment sentences (≤4 words)	3.6–41.9%	1.4–17.4%
First person /1k words	0.0–2.1	0.0–50.9
Type-token ratio (first 280 words)	0.59–0.69	0.53–0.67
MTLD lexical diversity	112–229	66–146
Mean Zipf word frequency	4.86–5.30	5.28–5.99
Sentence flow (mean run percentile)	0.19–0.41	0.49–0.73

The em-dash gap is the famous one, and it is real but weaker than its reputation. Averaged over each corpus, the AI pages use em dashes at roughly eighteen times the human rate, and the heaviest page lands one every sixteen words. But the tell only works in one direction. Three of the ten AI pages use fewer em dashes than the 2016 ripgrep README, and one uses none at all. A page drowning in dashes is almost certainly generated. A page without them proves nothing.

Exactly-three lists (“A, B, and C”) turn out to be the better punctuation-level tell. Every AI page produces them at least three times the rate of every human text, with no overlap anywhere. Even the most triad-prone human text, Joel’s essay, sits at a third of the most restrained AI page. Averaged over the corpora the gap is about nineteen-fold, and unlike the em dash, no AI page escapes it.

The labeled bullet was the discovery of the study for me. It is the bullet that opens with a short label, then a period, colon, or dash, then one sentence of elaboration. Here is a real one from mcporter.sh:

Typed clients. mcporter emit-ts emits .d.ts interfaces or a ready-to-run client wrapping createServerProxy() so agents call MCP tools with full TypeScript types.

The metric is the share of a document’s bullets that follow this shape. On the AI pages it is roughly four of every five, and on one page every single bullet does it. In the human baselines the share never reaches one in eight, and five of the eight human texts never use the shape at all; their bullets are plain items, like file names or flags, without the label-and-elaboration mold. If you have seen an AI-written landing page, you have seen walls of these, and it turns out the wall is more diagnostic than the punctuation.

Fragment sentences, the verbless punches of four words or fewer, sit in between. Most AI pages run high, and the worst writes two of every five sentences that way, but the groups overlap. The deliberately punchy Requests README out-fragments three of the AI pages. Fragments corroborate rather than convict.

First person taught me the opposite lesson. It looked like a strong tell until the corpus got fairer. Against essays the gap is enormous, since the antirez post averages more than one first-person word per sentence and the ten AI pages together contain exactly one. But the pre-LLM READMEs behave like the AI pages here. The Requests README has no first person at all, and ripgrep and Redis barely any. Authorial voice turns out to track register rather than authorship, so it works as a confirming signal at best. Vocabulary variety weakened the same way. Generated copy rotates in a fresh synonym at every mention, which pushes its lexical diversity high, and the AI pages do cluster at the top of the range. But the Requests README, which is deliberately punchy marketing prose, scores right among them, so the metric separates registers more than it separates authors. The human tendency it gestures at is still real, though. Human writers reuse the established word for a thing and repeat phrases for emphasis. Joel opens three consecutive paragraphs with “You are throwing away”, and a model would never.

The vocabulary story does not end there, because the type-token ratio was the wrong instrument. We remeasured word choice with two better ones. MTLD scores lexical diversity in a way that does not depend on document length, and the wordfreq package places every word on the Zipf frequency scale, where “the” scores about 7.7 and anything under 3.0 sits outside roughly the 30,000 commonest English words.

Both separate the groups where the TTR could not. All ten AI pages score above 111 on MTLD while seven of the eight human texts stay under 106, with the Requests README as the lone crossover once again. Mean word frequency is nearly as sharp from the other side. Every AI page averages Zipf 5.30 or below and every human text 5.28 or above, so the two ranges overlap only in a sliver 0.02 wide, where the ripgrep README brushes past the plainest-worded AI page. Neither axis classifies alone, since ripgrep crosses the frequency boundary and Requests crosses the diversity one, but each README fails only one test. Draw both thresholds, at Zipf 5.35 and MTLD 100, and every AI page sits in the rare-and-rotating corner while no human text enters it. The structural detector coming up needs only one of its two lines; this pair needs both, and both suffice. The reason is not exotic vocabulary. The rare words on both sides are ordinary jargon, “OAuth” and “stdout” against “Valgrind” and “malloc”. What differs is the connective tissue. Nearly half the words in the human texts are the commonest ones in English, the “the” and “of” that full sentences run on, while on the AI pages that share drops below three in ten, because telegraphic noun piles need no articles. So the rarity metric measures, from a third angle, the same thing the fragments and the labeled bullets measure, which is that generated landing copy does not write whole sentences.

Flesch-Kincaid grade, the standard readability score, misses all of it. The AI pages come out as easier reading than the human baselines because their sentences are short, and the formula never looks at what fills them. This post sits on the human side of both word-choice axes, as the green diamond in the chart shows.

Structural tells

Beyond the counters, we tested two structural claims, and both held on the larger corpus.

The first is identity deferral, which I wrote about in Good READMEs say what tools are. Generated copy describes what a tool does and dodges saying what it is. In sentences where the tool name is the grammatical subject, action claims outnumber identity claims five to one across the ten pages. The raw ratio alone proves little, since prose about a known subject is naturally verb-led. The positional version is the tell. All three pre-LLM READMEs establish identity in their opening lines: “ripgrep is a line oriented search tool”, “Requests is the only Non-GMO HTTP library for Python”, and Redis opens its first section with the literal heading “What is Redis?”. Among the ten AI pages, exactly one does the same. Four open with headless fragments like “A local-first GitHub triage tool for maintainers and agents.”, which name a category but carry no subject and no verb, and the remaining five open with benefit imperatives or setup instructions like “Keep your editor and git workflow.”

The second is heading register, and it produced a fun inversion. Title Case, the thing style guides nag about, belongs to the humans here. The SQLite docs use it in a fifth of their headings, as was the convention of their era, while the AI pages write modern sentence case throughout. What convicts the AI headings is rhetoric. About a third of the AI headings are slogans, imperatives, or rhetorical frames rather than labels; in the human set the share is one in ten, and those are mostly mild FAQ-style questions like “Why should I use ripgrep?”. The strongest single shape is what I now call the comma couplet, a parallel two-beat slogan like “Local loop, remote box”, “Two jobs, one binary” or “Small surface, clear split”. It appears eleven times across five of the ten sites. The human set produces it exactly once, and the exception is instructive. It is the title “Maker’s Schedule, Manager’s Schedule”, a deliberate one-off rather than a house pattern stamped down a page.

There is also a tell that only becomes visible when you put the pages side by side. Six of the ten sites have a “Pick your path” section, seven make a “five minutes” time-to-value promise, eight have a “Status” section, and six close with the exact sentence “Released under the MIT license.” Each page looks fine alone. Together they reveal one prompt’s house style stamped across unrelated projects.

A minimal detector

The expanded corpus simplified the detector, because the two structural metrics turned out to need no help. Flag a page as AI-flavored when exactly-three lists exceed 3 per 1,000 words or the labeled-bullet share exceeds 30%. Either rule alone classifies all eighteen documents correctly. The em dash dropped out of the detector. It convicts a page when present in bulk, but three of the ten AI pages use fewer em dashes than the 2016 ripgrep README, so its absence clears nothing.

Plotting the two structural metrics against each other shows how much margin the thresholds have. Every human text sits in the bottom-left corner, below both lines, and every AI page sits far outside both.

Eighteen documents make a demonstration rather than a validated classifier. The first version of this study used only essays and documentation as baselines, which left the objection that the metrics were separating registers rather than authors, so the corpus now includes three pre-LLM READMEs that sell tools the way the AI pages do. The structural gaps survived that control untouched, while two metrics that looked strong against essays alone, first person and lexical diversity, collapsed into register signals. That is the argument for keeping the baselines adversarial. The next escalation would be a large sample of post-LLM, human-written landing pages, but the sizes of the surviving gaps, three-fold at the closest edge and roughly twenty-fold on average with no overlap, make me confident the structural metrics would hold.

The flow of a sentence

Everything above counts features one at a time. The last metric came out of a different question. Read the two corpora side by side and the sentences feel different in a way none of the counters capture, so I asked whether the rhythm of consecutive sentences could be measured too. I did not know what to count in advance, so we searched for it in the style of karpathy/autoresearch. A frozen harness feeds every document to a candidate scoring function as a bare sequence of per-sentence measurements and reports how cleanly the scores split the ten AI pages from the eight human baselines. The scoring function is the only file that changes between runs, and every attempt lands in a journal that now holds over fifty experiments, most of them failures.

The failures narrowed the answer. Statistics of pure order, which measure how sentence lengths rise and fall while ignoring how large they are, all fell short of separating the groups, so the tell is not in the alternation. What survived is almost embarrassingly simple. Split each sentence at its punctuation marks and keep the longest piece, the longest run of words the sentence lets through without a pause. Human writers keep producing sentences that contain one long run, whatever their register. The AI pages break nearly every sentence before a run can develop.

The first version that separated the corpus scored each sentence on a ramp between a 10-word run and a 15-word run, and it worked, but both constants were picked by hand. Ranking removes them. Give each sentence the fraction of all runs in the corpus that are shorter than its own, and average those percentiles over the document:

\mathrm{flow} = \frac{1}{n}\sum_{i=1}^{n} F(r_i)

where $r_i$ is the longest run in sentence $i$ and $F$ is the distribution of runs pooled over the whole corpus. Statisticians will recognize the Mann-Whitney rank statistic. The flow score is the probability that a random sentence-run from the document outlasts a random run from the corpus, and it carries no tuned constants at all. We also tried a sliding-window generalization that multiplies the percentiles of consecutive sentences, and it looked stronger until we normalized the scale, at which point the gain vanished. That correction is in the journal too. The order of the sentences adds nothing; the length of the runs carries the whole signal.

The raw material makes the tell visible before any formula does. The chart shows the longest run in each of the first sixty sentences of one human baseline and of crabbox.sh, one of the two AI pages nearest the human range. The human post keeps clearing ten words without a pause. The AI page almost never does.

Averaging the percentiles gives one number per document, and the groups separate completely. The AI pages score between 0.19 and 0.41, the human texts between 0.49 and 0.73, and refitting the threshold with any single document held out still classifies all eighteen. The margin is thinner than the detector’s, since the strongest AI page comes within 17 percent of the flattest human baseline, the ripgrep README. This post scores 0.56, in the middle of the human range.

The reference implementation is analyze_flow.py in the study repository, and the search that produced it, including every dead end, is preserved in the autoresearch directory.

Long-form tweets in the wild

The corpora above have ground truth, which is what makes the thresholds checkable. The place people actually want a detector is the feed, where there is none, so as a last exercise we pointed the same counters at tweets. I keep a personal archive of tweets captured while browsing, about 27,000 at the time of writing, and from it we built one sample per account, made of every original long-form tweet (over 280 characters) in date order, for every account with at least 2,000 words of such text. That produced 42 accounts, my own included, and each sample is archived in the ai-smell repository with a source link per tweet.

The chart puts the 42 samples over three metric pairs, with the ground-truth pages and baselines left in faintly in every panel so the wild samples can be read against the corpus that set the thresholds. The first panel repeats the detector chart exactly, same axes, same limits, same two thresholds. Every panel also carries one more point, a green diamond for this post itself, measured the same way as everything else.

Nothing here is a verdict, since none of these samples has a known author process. Under the unchanged rules of the first panel, seven of the 42 accounts trip the detector. Four cross the triad line, three cross the labeled-bullet line, and none cross both. On the landing pages the tells came bundled, every page far past both thresholds at once, while in the feed each flagged account trips exactly one rule. The bullet share also rests on smaller counts here, since a thread carries far fewer bullets than a landing page, and one of the three flagged accounts crosses the line on just two labeled bullets. So the thresholds transfer, but the confidence does not; a verdict in the feed would need feed-specific baselines.

The other two panels show where the feed does and does not resemble the corpus. On the dash axes, six accounts run past every human baseline, topped by one at 34.5 dashes per 1,000 words, denser than eight of the ten AI pages; but the dash already dropped out of the detector on the ground-truth corpus, and it stays out here. Fragment share spreads the accounts across the whole range the two corpora span, so it separates nothing in the feed either.

The flow metric from the previous section gives an independent read on the same samples. It puts 13 of the 42 accounts below its midpoint threshold, and all three accounts that cross the labeled-bullet line are among them. That agreement is worth pausing on, because the two measurements share nothing. One counts bullet shapes and the other reads clause lengths, yet they flag the same accounts. The two lowest scorers sit below every AI landing page in the corpus, and the usual register caveat applies with extra force here, since the threshold was calibrated on landing pages against long-form prose and a punchy feed style will read as low flow on its own.

The account whose long posts sent me down this path is one of the three past the bullet line. Its sample writes 41 labeled bullets out of 90, a 46% share, half again past a threshold that no pre-LLM baseline came near, while its dash and triad rates sit mid-field. The counters flag it rather than clear it, with the caveat above about register. Its long posts also share a hook template the counters never measure. They open with lines like “90% of AI developers just…”, pivot on “It’s not X. It’s Y.”, and close with “Let me explain.” That is the next metric worth building, and the tweet samples are archived so anyone can beat me to it.

The de-smeller

The practical output of all this is kill-ai-smell, a skill in my tools repo that any coding agent can load. It covers the tells from punctuation up through page structure, and every rule carries a bad example next to a rewrite, because the rewrite teaches the hard half of the lesson. Fixing a smell means restructuring the sentence. Swapping an em dash for a punchy colon changes nothing.

The most instructive moment of the project happened while writing it up. My agent, fresh off measuring contrast rhetoric as a top tell, produced a report whose highlighted callout was titled “The strongest tells are structural, not lexical”. The detector fires on its own author. That lesson is now the second paragraph of the skill. Knowing the rules is no defense, because these patterns are how models write by default, so the sweep has to be mechanical, applied to your own output, and repeated on the text that describes the sweep.

Feel free to steal the skill, and if you run the metrics on a corpus of your own, I would love to see the numbers.

A plea

I realize I am doing slop vendors a favor by putting this out there. The tells are enumerated and scripted now, and anyone who wants their generated copy to pass can point a model at this post and patch the fingerprints. I accept that trade. What I can’t bear is actually legit people using AI to write their long-forms and publishing the model’s house style untouched, and I’d rather have everyone, slop vendors included, stop using patterns like “It’s not X. It’s Y.” and the triads than have to read them ever again.

I also know there is a brain behind those AI-generated posts. Somebody lived the experience worth posting about, formed the opinion, and then let a model flatten it into the same shapes as every other post in the feed. This favor is for them. The de-smeller is there so they can keep using the machine without shipping its defaults.

Appendix: the corpus

Every text is archived as measured in the ai-smell repository, which is the maintained home of the study. It holds the corpus, the analysis scripts, the raw results, and the figures. The AI pages were captured from the live sites in July 2026, with code blocks still in place (the script strips them before counting).

The corpus splits into four groups there:

corpus/ai holds the ten OpenClaw landing pages.
corpus/human holds the eight pre-LLM baselines: the SQLite testing docs, essays by Joel Spolsky (2000), antirez (2018), Paul Graham (2009), and Julia Evans (2019), and the ripgrep, Redis, and Requests READMEs at their 2016–2017 git tags.
corpus/tweets holds the 42 long-form tweet samples, one file per account, date-sorted, with a link back to each tweet.
corpus/self holds this post itself, archived as measured, provably AI-written by its own disclaimer. Written under the kill-ai-smell skill, it clears the detector from the other side, with zero em dashes, exactly-three lists at a sixth of the threshold, and no labeled bullets. The tells are a default, not a fingerprint, and a model instructed against them stops producing them.

@onusoz · /2026/07/12 · 06:47 PM View on

gotta love competition

Image hidden

@onusoz · /2026/07/12 · 01:54 PM View on

Anyone else notice that any non-codex harness is more token-efficient than codex on the same task?

@evalstate · Jul 12, 2026

I'm running Terminal Bench 2.1 with fast-agent (this is a subset while I try and get things working). Take this as raw, unchecked evalstate is normally wrong data but if we're talking about burning Codex quotas this might explain why I don't feel it so much?

Image hidden

@onusoz · /2026/07/12 · 01:11 PM View on

I SEE BURNS... BURNS EVERYWHERE

@paulg · Jul 11, 2026

This is like someone who can bench press 30 pounds writing about what it takes to bench press 600.

@onusoz · /2026/07/12 · 11:44 AM View on

Here is the conversation where I built it. You can just prompt things @ratatui_rs highly recommended for TUIs. It just looks nice github.com/osolmaz/ann…

@onusoz · /2026/07/12 · 11:26 AM View on

I vibed a TUI in just 13 codex messages, and it works great

@evalstate · Jul 12, 2026

AnnoTUI by @onusoz is next level🔥 Annotate assistant responses, return quotes and comments for your follow-up (video should explain). IMO one of the biggest productivity unlocks rn. Used here with fast-agent (I have it bound to CTRL+X a), easy to integrate anywhere.

@onusoz · /2026/07/12 · 05:35 AM View on

No better feeling than waking up to the smell of 5 successfully finished agents

@onusoz · /2026/07/12 · 05:32 AM View on

protip: use fable to write your READMEs and other user facing docs thank god fable exists, now that gpt 4.5 is gone

@onusoz · /2026/07/11 · 12:54 PM View on

.@WisprFlow is much more convenient on android than iOS because the accessibility settings are more permissive It lets me overlay a button to record, anywhere on the screen, and lets me keep the regular keyboard at the same time

@onusoz · /2026/07/11 · 10:40 AM View on

most well-deserved ad in the world @Alibaba_Qwen

Image hidden

@onusoz · /2026/07/10 · 08:04 PM View on

this is cool agentic behavior under /goal. in a separate session, I had created an implementation plan in the main branch. The agent picked it up without me explicitly telling it to. The goal was to "$autoimplement finish completely" Broad orders running in loops are useful

Image hidden

@onusoz · /2026/07/07 · 11:41 AM View on

huge endorsement by geohot on GLM 5.2

@__tinygrad__ · Jul 6, 2026

I cannot believe how good GLM 5.2 is. Several weeks in now and it's mostly all I have been using. It doesn't have the alignment issue of cloud AI, it's much more clear what it can do and can't because it doesn't care about you taking another $$$ spin at the token slot machine.

@onusoz · /2026/07/06 · 10:29 PM View on

This is significant, not sure how it will play out. There is a chance it might be for the better

@mitsuhiko · Jul 4, 2026

I had some vibes that Opus 4.8 was performing worse than older ones for some of uses that are off distribution and now I have the receipts. Latest Opus/Sonnet are causing tool invocation failures on Pi's edit tool when older ones did not! I wrote about it. lucumr.pocoo.org/2026/7/4/bette…

@onusoz · /2026/07/06 · 08:35 PM View on