I think I just broke local AI inferencing
Qwen3.5-35B-A3B AWQ
1,010,000 tokens context (yeah ONE Million)
4,350,080 tokens KV cache (FOUR POINT THREE Million, not a typo)
TurboQuant 3.5
All running on a USB-charger-sized GB10 GPU.
Now the fun part, vLLM access log numbers from a cold start run:
~350 tokens/sec generation throughput peak
~260 tokens/sec sustained under load
64 concurrent requests handled
0.4% to 6.6% KV cache utilization → meaning this thing is barely warming up
Prefix cache hit rate: 0% (no tricks, raw performance)
My system is not optimized. It’s heavily underutilized.
Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.
1 / 3
Shawn Kahalewai Reilly Did you move on from the Halo Strix? Mar 26
Marcel C. Any suggestions on making this on m5 pro with turboquant enabled? Does vllm work out of the box? Mar 26