If you are interested in running such demos, look into --demo mode in my local model swiss army knife localpi
https://t.co/LyjwJWDjmi
Thank you @googlegemma for the shoutout
New blog post: Using local models for agentic zero-shot classification, in real-time, high frequency triage
If you have a 128gb of memory for models (a DGX spark like I do for example), you can create a real time classifier and notifier for yourself that can classify more than >20 items per minute, using mid-sized @googlegemma and @Alibaba_Qwen models, with over 200-300 output tok/s aggregate throughput
Like processing new tweets on twitter, issues/prs on github, messages on telegram and discord, in real-time
Over the past few weeks, I have built one for myself, to filter and get notified about local model related issues on the OpenClaw repo
I initially thought gemma-4-e4b would give me the best tradeoff
I was wrong. I learned that if one has enough memory already, one should not bother with <10b models like gemma4 e4b or e2b. Precision and recall were much higher zero-shot with gemma-4-26b-a4b, whereas the smaller e4b needed significant prompt optimization to eventually not perform nearly as good
To provide more context to the model, I created a restricted bash-like shell, called reposhell. In that shell, it can run read-only commands to ls/find/grep/cat openclaw source code, but only that. When the PR description/diffs are not clear enough as to categorize it, the agent reads the code to figure it out
Because small models can get prompt injected, and I need to make sure that someone can't harm my setup by creating a malicious issue or PR in the openclaw repo
I found that for specific systems like this, it is very convenient to extend and bundle Pi. You can create agentic CLI tools that work fully locally and for free, and keep that separate from your main pi coding setup. localpager-agent has its own session dir and tools, and I ensure that it will run local models in a secure way by isolating it from my main pi setup
Once localpager-agent categorizes a PR/issue as local_models and related labels, I automatically receive it as a notification on Discord
The whole implementation is fully open source and MIT licensed, alongside the dataset we used to benchmark the performance
I believe zero-shot agentic classification running on local hardware will find many use cases across a wide variety of business applications, like news gathering, open source software development, customer support, content moderation, sales and so on
Agents increase the amount of information produced in a lot of systems, and hence we will need to set up cheap ways to wrangle all that information
In times where governments can cut off access to SOTA models on a whim, it is more important than ever to build your business on open models and if possible, run them on your own hardware!
Big thanks to @evalstate and @ben_burtenshaw for their valuable feedback, especially with helping me evaluate this more rigorously! One take-away is that categorizing contributions in an open source repo is a *hard* problem, and that it is not trivial to reliably create a golden dataset with LLMs, for evaluation purposes
Read more here: https://t.co/nHppWUjaqO
One sweep over 100 samples takes around 4 hours.
Next up: cross reference ground truth with predictions from hf-mem by @alvarobartt
https://t.co/45WN0jtEQj
gpt5.5 and most other models are very bad at one-shotting nice data models
gpt5.5 also has this annoying property that once it decides for a schema (or any design), it's very hard to trigger thinking again. and if you ask to "rewrite from scratch", it will write create something even more ridiculous
To solve this problem, I have built a meta-harness over codex just for simplifying slop data models called schemator (work in progress)
Basic idea: it mimics what I myself do while I am designing a schema: scrutinize and question each field one by one
It starts a fresh codex session for each field with a fixed prompt like "Try to come up with the most Lindy data model" + a prompt for side notes
It does that with a fresh context for each field, so that they are independent from each other. At the end of a review run over a field, the reviewer can propose to keep, rename or remove the field
When all fields are reviewed once, that makes one iteration. Then this is looped over until the review results stabilize, and do not propose any further changes
I get better results by just asking my agent to "use schemator on this" after it creates a JSON schema or SQL table
Give it a try if you have codex! It has a skill, so should be easy for an agent to figure out how to use