Entries for June 17, 2026

@onusoz · /2026/06/17 · 04:06 PM View on

I did some math, and running my Nvidia GB10 workstation (Asus GX10) costs me maximum: 12~13 USD / month or 150~160 USD / year It is a little bit above half the price of ChatGPT plus subscription. For that, I get to run models that can fit in 128 GB of memory How I calculated: You can see how much power your apartment uses in Singapore in half-hourly resolution. We turned off all devices and A/C while we sleep, and got only the fridge and the GB10 remaining From that, we see it uses around 80-100 Watt while I was running an inference workload overnight. So this is like an upper bound I take it as 90 Watt. Electricity here costs 0.25 SGD / kWh 0.09 * 0.25 * 24 * 30 * (SGD/USD conversion rate) = 12~13 USD / month = 150~160 USD / year Local models are getting very good now, small ones roughly around GPT 5.x-mini level. This workstation makes all sorts of workloads possible for me that would otherwise cost a ton on the API It is also my always on workstation that works overnight. I use Codex for my work, and my workstation is always running agents. It never sleeps. I never have to worry about keeping my laptop lid open. I connect and monitor the agents anytime on my phone using mosh and herdr We have crossed a threshold. Running local models is cheaper than a big token sub for quite a few workloads already. If you are running a business, that makes a difference The localening is here

@onusoz · /2026/06/17 · 04:52 AM View on

THE LOCALENING IS HERE

@mitchellh · Jun 16, 2026

We've gone really quickly from "local models are dogshit" to "local models are good actually" (like, a 12 month window from A to B). I don't think they're actually good ENOUGH yet. We need an Opus 4.5 quality local model. When that happens, I think the world will spill over. Opus 4.5 is/was amazing, and is more than good enough for almost all tasks still as long as you pair with a frontier-level planner/judge. It'll still require a hugely expensive machine to run it, I'm sure, like a $5K or more laptop or mac studio. But, that's going to be pennies compared to the API costs plus all the benefits of guaranteed privacy and so on.

@onusoz · /2026/06/17 · 03:55 AM View on

New agent benchmark alert: SkillsBench

@xdotli · Jun 16, 2026

A big pain point in using AI benchmarks is encountering errors after its first release. Today, we're releasing SkillsBench 1.1, the first benchmark for how well AI agents use skills, now audited end to end and verified error-free. Prof. @dawnsongtweets joins 1.1 as advising author. We worked through every task with several frontier labs to eliminate the errors in the previous version. We also added new tasks, moved the ones with external dependencies into a separate set so the core suite runs clean, and expanded coverage to more models. Capability is climbing fast. The best with-skills resolution rate rose from ~36% (Claude Sonnet 4.5, Sep 2025) to 67% (GPT-5.5, May 2026), about +1.9 points per month. The frontier is hill-climbing SkillsBench fast. The right skills still matter. Across the fleet, curated skills lift resolution rate by +16.6 points on average (33.9% → 50.5%), and by as much as +25.7 points for a single model. The top configuration is GPT-5.5 on OpenHands at 67.3%. By popular demand (thx Nate @cursor_ai), we're now tracking skills invocation: how often an agent actually uses the skills it's given. Recent flagship configurations invoke them 90–99% of the time (Codex 99%, OpenHands + GPT-5.5 92%, Gemini CLI 90%), versus roughly 50% for older setups. Also new in 1.1: @OpenHands joins as a fourth harness, alongside Claude Code, Codex, and Gemini CLI; a rebuilt leaderboard with refined categories, subdomain skill rankings, and Skill Lift; and native task . md on BenchFlow, with multi-scene environments and rollout branching. We also partnered with @k_dense_ai to add scientific skills to some science tasks. One implication for deployment: skills can substitute for scale. GLM 5.1 with skills (58.4%) outperforms Opus 4.8 without (45.7%). A smaller model with the right procedural knowledge can beat a larger one running without it. Huge thanks to @nick_kango @ivanleomk @kaggle @GoogleDeepMind for hosting a launch event with us. Thanks for everyone who's come on May 27! Also thanks to our partners @gneubig @OpenHandsDev @ivanburazin @daytonaio @jackminong @johannes_hage @PrimeIntellect @TimothyKassis @k_dense_ai for providing support in credits, compute, and skills. SkillsBench live leaderboard will also come to @ValsAI. Many people have told us they use SkillsBench as an index to measure models' agentic capability over diverse and high GDP value domains. Great work on Valkyrie as well! @ Jarett @nikilravi @langstonnashold @RayanKrishnan SkillsBench is fully open-source. Explore the leaderboard and tasks, read the docs, or contribute your own skill set or harness and join the leaderboard. 🧵

Image hidden