Another annoying GPT-ism (circa June 2026): while describing something, it always describes what it *does*, but never what it *is*
> "LocalPerf benchmarks local LLM inference servers and keeps the evidence in one portable run artifact."
If I were a philosopher, I would say that "AI lacks ontology". Or that it is "anti-essentialist", believes that things cannot be things in themselves
But all models literally have an ontology. They have it since word2vec days, you can plot it out. It's just an annoying tendency in GPT's writing
So using philosophy to understand AI might be dumb sometimes
If you don't want your README's to sound like slop, then you can steal my write-readme skill:
After write-readme: "LocalPerf is a local LLM inference benchmark CLI. It runs benchmark plans against local inference servers and stores the evidence in one portable run artifact."
Skill: github.com/osolmaz/tools/blob/main/agents/skill…
Give @LottoLabs a follow if you are not already
He is building localmaxxing.com, crowdsourced LLM benchmark results and performance profilings
Very very cool
This. Especially when the whole machine hangs instead of OOMing when my agent accidentally loads too many models into memory
(lmk if there is a firmware update or sth that fixes this on the spark now, creating a cgroup doesn’t work)
forums.developer.nvidia.com/t/dgx-spark-be…
The Local Frontier is advancing
The amount of AI memory for inference we can get for less than $3000 has been steadily increasing
The memory crunch has slowed this down and even made it retrograde. However, once we bounce back from it, the progress will be glorious
Besides being hot, this take is very correct
and points out to a fundamental tradeoff in the storage layer being centralized versus distributed
git is distributed and that makes total sense for code which takes small space by its nature. it is cheap for everyone to duplicate it locally. this proved to be very useful over e.g svn, when devs could develop independently from the centralized server
AI artifacts, however, are 1 million times bigger than code. in that case, the bottleneck becomes storage and network. decentralization and version control become lower priority. they can be sacrificed
the tradeoff tilts towards getting the cheapest possible storage and transfer. because you will need a LOT of that
I regret to announce to competitors that Hugging Face has already won this game when they acquired Xet. the tech just works, and the network effects are immense
My posts here on X now sync automatically to my blog, giving me full ownership of my content and zero effort SEO
For free, no API costs
My long-form posts are automatically featured and titled on the front page of solmaz [dot] io. Filtering is done by my claw running a sync-x skill daily, which then notifies me on Discord
How do I scrape the posts?
ALL the posts I view (including the ones I post) are saved locally using @kubmi's xTap and then synced to a private repo using my extension on it xtap-sync
My claw has access to that private repo and can run programmatic tasks like sync-x, summarize what happened that day, notify me about any topics I want
If you are interested in running such demos, look into --demo mode in my local model swiss army knife localpi
github.com/dutifuldev/loc…
Thank you @googlegemma for the shoutout
New blog post: Using local models for agentic zero-shot classification, in real-time, high frequency triage
If you have a 128gb of memory for models (a DGX spark like I do for example), you can create a real time classifier and notifier for yourself that can classify more than >20 items per minute, using mid-sized @googlegemma and @Alibaba_Qwen models, with over 200-300 output tok/s aggregate throughput
Like processing new tweets on twitter, issues/prs on github, messages on telegram and discord, in real-time
Over the past few weeks, I have built one for myself, to filter and get notified about local model related issues on the OpenClaw repo
I initially thought gemma-4-e4b would give me the best tradeoff
I was wrong. I learned that if one has enough memory already, one should not bother with <10b models like gemma4 e4b or e2b. Precision and recall were much higher zero-shot with gemma-4-26b-a4b, whereas the smaller e4b needed significant prompt optimization to eventually not perform nearly as good
To provide more context to the model, I created a restricted bash-like shell, called reposhell. In that shell, it can run read-only commands to ls/find/grep/cat openclaw source code, but only that. When the PR description/diffs are not clear enough as to categorize it, the agent reads the code to figure it out
Because small models can get prompt injected, and I need to make sure that someone can't harm my setup by creating a malicious issue or PR in the openclaw repo
I found that for specific systems like this, it is very convenient to extend and bundle Pi. You can create agentic CLI tools that work fully locally and for free, and keep that separate from your main pi coding setup. localpager-agent has its own session dir and tools, and I ensure that it will run local models in a secure way by isolating it from my main pi setup
Once localpager-agent categorizes a PR/issue as local_models and related labels, I automatically receive it as a notification on Discord
The whole implementation is fully open source and MIT licensed, alongside the dataset we used to benchmark the performance
I believe zero-shot agentic classification running on local hardware will find many use cases across a wide variety of business applications, like news gathering, open source software development, customer support, content moderation, sales and so on
Agents increase the amount of information produced in a lot of systems, and hence we will need to set up cheap ways to wrangle all that information
In times where governments can cut off access to SOTA models on a whim, it is more important than ever to build your business on open models and if possible, run them on your own hardware!
Big thanks to @evalstate and @ben_burtenshaw for their valuable feedback, especially with helping me evaluate this more rigorously! One take-away is that categorizing contributions in an open source repo is a *hard* problem, and that it is not trivial to reliably create a golden dataset with LLMs, for evaluation purposes
Read more here: huggingface.co/blog/local-models-pr-triage
One sweep over 100 samples takes around 4 hours.
Next up: cross reference ground truth with predictions from hf-mem by @alvarobarttgithub.com/alvarobartt/hf…
gpt5.5 and most other models are very bad at one-shotting nice data models
gpt5.5 also has this annoying property that once it decides for a schema (or any design), it's very hard to trigger thinking again. and if you ask to "rewrite from scratch", it will write create something even more ridiculous
To solve this problem, I have built a meta-harness over codex just for simplifying slop data models called schemator (work in progress)
Basic idea: it mimics what I myself do while I am designing a schema: scrutinize and question each field one by one
It starts a fresh codex session for each field with a fixed prompt like "Try to come up with the most Lindy data model" + a prompt for side notes
It does that with a fresh context for each field, so that they are independent from each other. At the end of a review run over a field, the reviewer can propose to keep, rename or remove the field
When all fields are reviewed once, that makes one iteration. Then this is looped over until the review results stabilize, and do not propose any further changes
I get better results by just asking my agent to "use schemator on this" after it creates a JSON schema or SQL table
Give it a try if you have codex! It has a skill, so should be easy for an agent to figure out how to use
github.com/dutifuldev/schemator
gpt 5.5 is not naturally good at modeling and cannot create simplified nice mathematical models completely autonomously
I did a parameter sweep with gemma-4-31b-a4b on memory usage, output tok/s etc. while varying context window, concurrency and other parameters. It took quite a few tries, and I still do not trust the model that gpt5 fit to the data
besides, it measured linux cgroup memory and not the actual gpu memory used, so the whole sweep is wasted...
output tok/s looks more accurate though, soon I will have a model that can give the optimal parameters over the space of context window <> concurrency <> tok/s <> memory usage
off to do another run
For my recent LLM leaderboard osolmaz-leaderboard.hf.space, I sum up all time total downloads (or likes) across model variants, and then divide it by the age of that model. I.e. "time decay" for popularity
This gives a more time-agnostic metric for the popularity of that model. In an ideal ranking, older models that are not popular anymore should be demoted, like 2 year old Llama 3 models. If you don't do that, they might still occupy top 10 needlessly, despite having been replaced by e.g. qwen in practice
Thanks to that, qwen-3-6b which came up 1 year ago and has 150m downloads can surpass llama-3-1-8b which came up 2 years ago and has 200m downloads
More notes on my post: solmaz.io/popularity-ranking
if you take the Most Downloaded Models of All Time, Llama 3.1 makes it to Top 10 with around 200 million total downloads (ranking is done w.r. to time-averaged downloads)
RIP Llama, you walked so @googlegemma and @Alibaba_Qwen can run
Also a reminder that if you build your branding on top of open weight models developed by big corps, you might eventually be the de facto owner of that brand if they pull the plug on it. Like llama.cpp @ggml_org
Huge fumble by Meta
My LLM leaderboard osolmaz-leaderboard.hf.space auto discovers different variants of model releases, even if they are not linked by base_model
From this, I found out that @RedHat_AI was the first to release NVFP4 quantization for qwen3-6-35b-a3b
Nice to see everything in one place
I recently did some work ranking models on Hugging Face. While doing that, I remembered some concepts I had known years ago from studying recommender systems. But I couldn’t find any personal notes from that time.
So I’m leaving this cheatsheet for my future self, if I ever need it again.
The main idea with popularity metrics is that it is proportional to e.g. total likes/views, and inversely proportional to the time passed to accumulate those likes/views. A lot of different platforms came up with many different ways to calculate this. And while you can model this in a certain way that maximizes some imaginary objective, what ends up being implemented first is the cheapest/most efficient algorithm.
Below are some examples, generated by GPT 5.5 xhigh.
<slop>
The abstract problem is:
\[\text{rank items by scarce attention}\]
A platform has many items and a limited front page. It needs to decide what deserves visibility now. That is usually not the same as “best,” “most useful,” or “most popular all time.”
A clean taxonomy:
\[\text{popular} = \text{received a lot of attention}\]
\[\text{hot} = \text{received a lot of attention recently}\]
\[\text{trending} = \text{receiving more attention than expected}\]
Here $A$ means attention: views, downloads, likes, votes, streams, sales, stars, comments, clicks, etc.
Raw popularity
This is the simplest ranking.
\[S = A\]
Use it when you want “biggest ever.”
Examples: most downloaded, most viewed, most sold, most starred.
Problem: old items dominate because they had more time.
Velocity
This measures speed of attention.
\[S = \frac{A}{t}\]
where $t$ is age.
Use it when you want “how fast is this spreading?”
A stricter version:
\[S = \frac{A}{(1+t)^\alpha}\]
If $\alpha = 1$, this is close to attention per unit time.
If $\alpha < 1$, old items are penalized more gently.
If $\alpha > 1$, new items are favored aggressively.
This family is close to what Hacker News describes at a high level: HN says its basic ranking divides points by a power of time since submission, while also applying other factors such as flags, anti-abuse systems, demotions, account/site weighting, and moderator action.1
Log-scaled velocity
Raw attention often follows a power law: a few items get enormous numbers. So platforms often compress the signal.
\[S = \frac{\log(1+A)}{(1+t)^\alpha}\]
This keeps huge items ahead, but prevents them from crushing everything else.
This is usually a better “hotness” formula than plain:
\[S = \frac{A}{t}\]
because it rewards scale without making scale the only thing that matters.
Recent-window popularity
Instead of lifetime attention, count only a recent window.
\[S = A_r\]
where $A_r$ is recent attention.
Or normalize by window size:
\[S = \frac{A_r}{w}\]
where $w$ is the time window.
Examples:
\[\text{most viewed today}\]
\[\text{most streamed this week}\]
\[\text{most downloaded in the last 30 days}\]
Spotify’s daily and weekly charts are this kind of family, though Spotify also says it uses chart-eligible streams and filtering formulas to protect chart integrity; it does not simply expose raw app stream counts as chart counts.2
Momentum
Momentum compares the current period with the previous period.
\[S = \frac{A_r + 1}{A_p + 1}\]
where $A_p$ is previous-period attention.
Example:
\[S = \frac{\text{downloads this week}+1}{\text{downloads last week}+1}\]
This finds things that are accelerating.
Problem: small items can look extreme. Going from 1 to 20 is a $20\times$ jump, but it may still be tiny in absolute terms.
A safer version mixes ratio and volume:
\[S = \log(1+A_r)\frac{A_r+1}{A_p+1}\]
Trend detection
Trending is not just “popular.” It usually means “unusually active relative to expectation.”
\[S = \frac{A_r + 1}{E + 1}\]
where $E$ is expected attention.
If something normally gets 100 views/day and now gets 10,000, it is trending.
If something normally gets 10 million views/day and now gets 10.5 million, it is popular but not necessarily trending.
Another version:
\[S = A_r - E\]
The ratio version favors surprise.
The difference version favors large absolute surges.
Google Trends is a useful example of normalization: it divides search interest by total searches for the relevant geography and time range, then scales results from 0 to 100, so large regions do not automatically dominate raw volume rankings.3
Hotness
Hotness combines attention and freshness.
A simple hotness score:
\[S = \log(1+A) - \lambda t\]
Popularity pushes up. Age pulls down.
Another common form:
\[S = \frac{\log(1+A)}{(1+t)^\alpha}\]
This says: “large attention matters, but old attention decays.”
Classic Reddit-style hotness
The old open-source Reddit code had a “hot” formula based on vote balance, logarithmic scaling, and time. In simplified notation:
where $u$ is upvotes, $d$ is downvotes, and $T$ is time since a reference epoch. This is specifically the archived open-source Reddit implementation, not a guarantee of current Reddit production ranking.4
The important idea: votes matter logarithmically, and time strongly affects ordering. This makes the ranking feel alive.
Time-decayed attention
Instead of using age directly, you can make every attention event fade over time.
\[S = \sum A_i e^{-\lambda \Delta t_i}\]
where each attention event $A_i$ contributes less as it gets older.
Plain language: a view today counts more than a view last month.
This is good when you have event-level data.
A simpler approximate version:
\[S = A_r + \beta A_p\]
where $0 < \beta < 1$.
Example:
\[S = \text{attention this week} + 0.5 \times \text{attention last week}\]
Steam’s real-time Top Sellers use this general idea in a revenue context: Steam says it rolls up player spending from the trailing 24 hours and gives extra weight to spending in the last 3 hours, across base game purchases, DLC, and in-game transactions.5
Quality-adjusted popularity
Sometimes attention alone rewards clickbait. So platforms mix attention with satisfaction.
YouTube Charts disclose this kind of multi-signal logic: they consider view count, how quickly views are growing, where views come from, topic, age, and performance compared with recent uploads from the same channel; YouTube explicitly says the highest-view-count video is not necessarily ranked first.6
Confidence-adjusted ranking
This prevents tiny samples from winning too easily.
Bad ranking:
\[S = q\]
This lets an item with 2 perfect ratings beat an item with 10,000 very good ratings.
A Bayesian shrinkage version:
\[S = \frac{n}{n+k}q + \frac{k}{n+k}\bar{q}\]
where $n$ is sample size, $q$ is the item’s observed quality, $\bar{q}$ is the global average, and $k$ controls how much evidence you need before trusting the item.
Plain language: with little data, pull the score toward the average.
IMDb is an example of this family in spirit: IMDb says it publishes weighted vote averages rather than raw averages, that not all votes have the same impact, and that it does not disclose the exact method.7
Wilson score ranking
For up/down votes, a common confidence-based formula is the Wilson lower bound.
Let:
\[p = \frac{u}{n}\]
where $u$ is positive votes and $n$ is total votes.
This estimates a conservative lower bound for true positive rate.
Use it when you want “best-rated with enough evidence,” not merely “highest average rating.”
Evan Miller’s “How Not To Sort By Average Rating” popularized this for web rankings, and the archived Reddit code includes a confidence sort using the Wilson method; Stack Overflow also discussed the same family of sorting methods for comments/answers.8
Category-normalized popularity
Raw popularity is unfair across categories.
\[S = \frac{A}{\bar{A}}\]
where $\bar{A}$ is average attention in that category.
Example: a niche item with 10,000 downloads may be huge in its category, while a general consumer app with 10,000 downloads may be irrelevant.
A velocity version:
\[S = \frac{A/t}{\bar{A}/\bar{t}}\]
Use this for “popular relative to peers.”
Spotify’s Local Pulse is a real-world example of relative popularity: Spotify says Local Pulse shows songs uniquely popular in a city relative to their overall popularity.2
Composite ranking
Most mature platforms do not use one pure formula. They combine signals.
Product Hunt is explicit that its homepage leaderboard changes based on upvotes, comments, time since submission, and other factors, while withholding exact details to reduce gaming.9
Amazon’s book sales ranking is also a composite/decayed-relative system: Amazon says rankings reflect recent and historical activity, recent activity is weighted more heavily, ranks are relative to other books, and rank can change even if the item’s own activity stays constant.10
If you are in AI, just don’t be anon here
I see a bunch of anon accounts posting great local model content… what’s the point of being anon? To seem cool?
Most of those accounts are not doing anything illegal, so there is no point. It would add so much more legitimacy to your work if you just put your real face and not a slop or anime girl pfp
it only makes sense for those who are abliterating models. otherwise, it makes you seem sus
just put your real face anon
I created an LLM leaderboard based on Hugging Face download and like counts, grouped, filtered and time-averaged. Top 5 downloads is shared by @Alibaba_Qwen and @googlegemma 👑🤝👑
Top 5 likes, on the other hand also includes @deepseek_ai V4 Pro 👑
Even @OpenAI makes it to #8 top downloads with gpt-oss-20b 👑
qwen3-6-35b-a3b is the second most CIRCULATED LLM of this year, with an average of 21 million downloads per month, since the day it was released 2 months ago 📈📈📈
Despite first place belonging to 8mo old qwen3-vl-2b-instruct, the highlight belongs to the mid-sized MoE model, which has hit a size/performance sweet spot so hard that it absolutely 💥 SHATTERED 💥 Hugging Face leaderboards in the 2 months since it has launched
qwen3-6-35b-a3b is followed closely by its dense sibling 27b --- and then the mid-sized gemma 4 models 26b-a4b and 31b
Note that a model's distribution is inversely proportional to its size, but not strictly! Usefulness plays a factor as well, since gemma 4 26b-a4b is being downloaded more than the smaller gemma 4 e4b
I created this leaderboard because Hugging Face's all time highest downloads and likes did not give me enough information about what is really popular, neither today, nor all-time. I wanted something in between
How do I calculate this ranking?
- Get models that with n_downloads >= 100k
- Exclude models older than 1 year
- Deduplicate and group quantizations and variants of the same model based on slug prefix heuristics
- For each group, sum up total downloads of all time
- Sort by descending total_downloads / age = average_downloads_per_day (can also sort w.r. to likes per month)
- Repeat every day to get the most up to date ranking
More info and source on the leaderboard page, hosted on a Hugging Face space: osolmaz-leaderboard.hf.space
This is a work in progress, please reply below if you see a model that should be there is missing, or any other mistakes
gemma-4-26b-a4b is the most CIRCULATED LLM of recent history, with an average of 126k downloads per day, since the day it was released 3 months ago
Top 10 is shared by Qwen and Gemma, with DeepSeek V4 Pro coming in close 🤝
Note that a model's distribution is inversely proportional to its size, but not strictly! Usefulness plays a factor as well, since gemma 4 26b-a4b is being downloaded more than the smaller gemma 4 e4b
I created this leaderboard because Hugging Face's all time highest downloads and likes did not give me enough information about what is really popular *these last few months*
How do I calculate this ranking?
- Get models that with n_downloads >= 100k
- Exclude models older than 1 year
- Sort by descending total_downloads / age = average_downloads_per_day (can also sort w.r. to likes per month)
- Deduplicate quantizations etc. of the same model based on slug prefix heuristics
More info and source on the leaderboard page, hosted on a Hugging Face space: osolmaz-leaderboard.hf.space
I need better UI/UX on queueing messages to agents. I want to be able to:
switch the order of queued messages
pause the queue
edit any message that are still in the queue
undo steer messages in the few seconds they are being sent
I want more visual emphasis on the queue, like a Queue View I can toggle, that puts the queue at the center
I want this in all the UIs and coding agents, codex CLI, desktop, moshi... especially while on the phone
16x parallel Gemma-4-26B-A4B-NVFP4 runs 🤯🤯🤯
18 output tokens/s, aggregate 300 tok/s
1 DGX Spark with 128 GB unified memory
Concurrency so high I had to demo it programmatically
It can go up to 32 even! 🤯 But then my screen would not have been readable for you
And this is not even using flashinfer yet! Please reply if you know whether support is on the way
Note that this is not dumb e4b or e2b that you can run on the average laptop. This is the big Gemma MoE
Model link: huggingface.co/nvidia/Gemma-4-26B-A4B-NVFP4
I did some math, and running my Nvidia GB10 workstation (Asus GX10) costs me maximum:
12~13 USD / month or 150~160 USD / year
It is a little bit above half the price of ChatGPT plus subscription. For that, I get to run models that can fit in 128 GB of memory
How I calculated:
You can see how much power your apartment uses in Singapore in half-hourly resolution. We turned off all devices and A/C while we sleep, and got only the fridge and the GB10 remaining
From that, we see it uses around 80-100 Watt while I was running an inference workload overnight. So this is like an upper bound
I take it as 90 Watt. Electricity here costs 0.25 SGD / kWh
0.09 * 0.25 * 24 * 30 * (SGD/USD conversion rate) = 12~13 USD / month = 150~160 USD / year
Local models are getting very good now, small ones roughly around GPT 5.x-mini level. This workstation makes all sorts of workloads possible for me that would otherwise cost a ton on the API
It is also my always on workstation that works overnight. I use Codex for my work, and my workstation is always running agents. It never sleeps. I never have to worry about keeping my laptop lid open. I connect and monitor the agents anytime on my phone using mosh and herdr
We have crossed a threshold. Running local models is cheaper than a big token sub for quite a few workloads already. If you are running a business, that makes a difference
The localening is here
Click open GitHub PRs and issues directly in the side pane in @herdrdev, instead of having to go to the browser. As many issues and PRs as you want, WITH TABS!
Install ghzinga herdr plugin and just ctrl+click the link: github.com/dutifuldev/ghzinga
Thanks @lumendriada for sneaking in the ability to capture link clicks 2 days after I requested it! God I love open source...
Trying to copy wrapped URLs is a pain not only in ghostty/iterm2 but also in mobile apps like Moshi
On the laptop it’s fine because I can select rectangular area and edit it, or make the window bigger
On the phone, its’s impossible. Fingers too big, too much of a hassle
Should a terminal emulator try to detect these? It already detects herdr. What do you think @odd_joel
nvidia/Qwen3.6-35B-A3B-NVFP4 running in vLLM nightly on my Nvidia GB10 is actually insane
50 tok/s, 4 concurrent generations. total 200 tok/s. ideal for spawning subagents or working in parallel
its tool calling behavior is very good as well. I will be giving it test drive on an openclaw instance, and keep you posted
More details on NVIDIA forum: forums.developer.nvidia.com/t/benchmark-report-…
Current average generation speeds for local DeepSeek-V4-Flash-Q2, highest to lowest:
Mac Studio M3 Ultra: 32 tok/s
MacBook Pro M5 Max: 30 tok/s
Apple ??? M4 Max: 25 tok/s
MacBook Pro M3 Max: 24 tok/s
Mac Studio M2 Ultra: 22 tok/s
NVIDIA DGX Spark / GB10: 13 tok/s
It seems macs' higher memory bandwidth is contributing here, though I'm not sure if GB10 performance could be improved (I do hope so, I have one!)
Btw, TTS has come such a long way, @GoogleDeepMind cooked with gemini-3.1-flash-tts
I gave Codex my google credentials and it oneshotted the Gemini TTS implementation
When I built this 4 years ago, Azure TTS used to be SOTA. Then @ElevenLabs came in and raised the bar super high. Now Google is going after their lunch with controllable expressiveness at scale. I cheer for both!
Here is Manim Voiceover demo from 4 years ago with Gemini TTS (sound on)
I major concern I have these days is, while I author code in languages I cannot manually code, are they any good?
Over years, I have worked with a number of languages: C, C++, Fortran, MATLAB, JavaScript
But Python was my go-to language since more than 10 years. Well that changed last summer
So while I have strong opinions on how Python code, should be, conventions and all, I don't have so strong opinions on other languages. That means I am producing slop by default in Rust, Go and TypeScript
To solve that problem, I created github.com/dutifuldev/slophammer
Its aim is to be "the only tool and resource your agent needs, to minimize slop"
It is inspired by the recent bathrobe rants of @unclebobmartin, a.k.a. the author of clean code
It enforces a minimum test coverage, maximum cyclomatic complexity, mutation tests, code style across different languages
But I have a major issue: How do I know that Slophammer itself isn't slop?
One way is to implement and use it for Python, the language I know better, and judge what kind of changes it enforces
So for this weekend experiment, I used Slophammer to refactor, improve coverage and merge new features to one of my old Python projects, Manim Voiceover github.com/ManimCommunity/manim-voiceover
The result is... mixed. We now have types everywhere, which is great. But the constraints have also made it write garbage code like this one. It works fine, even though it's not elegant. The new feature also works
What do you think? Does code still need to be aesthetically pleasing to the human eye? Should it still be human readable?
If an agent writes slop in the forest, and there is no-one to read it, is it still slop?
If anything, I should use its output in Python to reason about other languages, and add more and more constraints. The more the constraints, the less the slop
I got the names for all future models Anthropic will release
By asking ChatGPT “Cool sounding names that mean a work of literature”
Codex is one of them 💀
Dabbling in GEPA. Codex's /goal on GPT 5.5 high is still surprisingly reward-hacking
I had set a /goal before I slept to implement a plan. It ended the loop after doing just 1 iteration
It feels like the model is following the path of least resistance and slacking off. Though it could also be me putting "try to make good progress in 8 hour's time" in the prompt, can't be sure
Lesson: When you are doing such a solver loop, always specify min_iter and max_iter
I knew disappointment was around the corner, the flicker company being the flicker company
The last time I paid them from my pocket was September 2025
It’s supposed to not be their fault, but still…
Experimenting with SOUL.md on gemma4-26b-a4b (running on @DeepInfra)
Interesting that such a lightweight model can already run such a conversation in openclaw harness
@GoogleDeepMind cooked here
don’t focus on the word “loop” so much, focus on “verifiability”
writing a loop is trivial. what makes the loop work is that there is a verifiable goal with a clear signal of success vs failure
verifiable = loopable
What did @karpathy see / was shown?
Why did the benefactor and teacher of the whole ML ecosystem join Anthropic, a company the polar opposite of his image, on the eve of such a powerful model release
It can't be purely money
Did he reckon that the only way to benefit humanity was to be on the inside, or rather, to not be left outside, of whatever is brewing in there?
It’a been a little bit over 1 year since Anthropic released their Max plans and Claude Sonnet and Opus 4, thus making Claude Code affordable and kickstarting the agentic revolution
Opus 4 was a glimpse into the future. I’ve spent the entire summer swearing at it and typing ultrathink
Today, Fable 5 feels like another step change
I no longer need to type ultrathink. And no longer need to swear at Anthropic models. Only at their marketing team.
Fable burned through my 5 hour quota, and then automatically fell back to usage credits without asking. Org settings I suppose
It was burning through 1 usd every few seconds
It burned through 66 usd before I reacted. Yeah, this is not affordable for anyone with that API pricing, without subsidy/plan
Speaking of loops, I have renamed my implementation-loop skill from earlier this year to autoimplement, because it's shorter
Calling skills that loop auto-x, auto-y makes them more memorable than calling them x-loop, y-loop
But it also increases the number of keystrokes you have to type, before you can tab-complete them
Alas, I like still this more github.com/osolmaz/tools/tree/main/agents/skill…
Ok so there is auto mode which they introduced back in March, but apparently they are not so confident in it that it's still in experimental mode and not easily findable in settings
code.claude.com/docs/en/permis…
To YOLO with Fable 5, or not to YOLO, that is the question...
The last time I left, Claude models still had tendencies to rm -rf your home folder or delete stuff without asking first. Is this still a risk?
And from the looks of it, Claude Code still doesn't have Codex's LLM-filtered approval gate feature. Or am I missing something?
Please enlighten your fellow Claude noob 😇
Just in time for a lot of Codex-default developers going back to Claude Code momentarily to try out Fable 5
Here is a CLAUDE.md -> AGENTS.md symlinker that should save you from the hurdles of obstinate Anthropic conventions
It installs a hook that creates the CLAUDE.md symlink automatically as Claude Code traverses directories that contain AGENTS.md, automatically ignored by git
No need to create CLAUDE.md with reference to AGENTS.md like Anthropic suggests. It just works
github.com/dutifuldev/claude-md-symlinker
TUIs can be easy! look at what right-click does in @herdrdev
refreshing to see something that works with both the keyboard and the mouse. and all this would not have been possible without @ratatui_rs
Question to my ghostty-savvy friends
I am trying to reproduce the Quake style dropdown experience I have been using since 2010 on ghostty on mac here. nothing works quite as well as iterm2 yet
I tried ghostty quick terminal mode. good but it doesn't let me open multiple tabs
I tried cmux because it ships ghostty anyway and is supposed to have more features. but its system-wide hotkey is not playing well with aerospace and window focus
iterm2 worked perfectly. tap control double and I'm in the terminal. is there anything that replicates this UX
ghzinga can now show multiple PRs/issues in tabs natively, no need to create a new pane in tmux/herdr
also, you can tell your agent to open all the relevant issues/PRs in a side pane using it, and it should work seamlessly
it's the open source maintainer's best friend. life is too short to juggle 100 tabs in chrome, why not have it right next to codex!
Here is the source, I called it ghzinga. You can click click click by default (unlike gh dash, which is still awesome in itself)
For just viewing single issues/PRs
github.com/dutifuldev/ghz…
.@herdrdev is cool. I am tired of doing back and forth with github in the browser, so I created my own clickable PR/issue viewer, inspired by gh-dash
put that in the left pane, codex on the right. saves me so much time
Wait did anyone think otherwise? lol
128 GB unified memory, 20 cores, "Spark" in the name...
I didn't watch the presentation. Maybe because of that I directly inferred that it's the same chip
Thank you @ashleywolf for helping me personally, I really appreciate it! The account was reinstated less than 1 hour of posting this!
The whole company must be working hard to make github scale in an era of crazy demand and growth!
I am a paying customer of github. I have a team account with 2 seats, one for me, and one for my agent. I have been paying for more than a year now
I do this because I treat my agent's workstation as a lower trust machine, and do not allow merging to main in certain repos
I have been working on a tool that calls github's graphql API. today, my agent's account username:dutifulbob got suspended for no reason
what am I supposed to do now? put my main account on my openclaw instance? I applied to reinstate, it appears it might take weeks to enable it back???
Maybe don't pull such things on your long term paying customers @github??