Who is running local models on GPUs on OpenClaw?
I have started benchmarking different models this week. I am working on improving model selection and switching UX on OpenClaw, i.e. I run
/model vllm/gemma-e4b
to switch the model in a channel, and then a model controller automatically loads that into memory, gets it ready, or gives an insufficient memory error, if capacity is not enough for that. Like when you are using multiple models in parallel
I am going to try llama-swap, LM Studio and Ollama for this next and compare them. There are a ton of variants of models, weight formats and quantizations, which need benchmarking
I have been using unquantized original safetensors until now, which already gave me the ability to run ~5 parallel generations in my hardware
So if I am going to try LM Studio, I would rather use the bf16 ggml-org/gemma-4-E4B-it-GGUF instead of anything smaller --- because there is no point in nerfing an already smol model if your hardware can run 5 parallel sessions on the unquantized version
Will also release vibe reports and benchmarks on all this with @mervenoyann later this week
I would like to hear your thoughts if you have already tried these models on OpenClaw