Google just published TurboQuant https://lnkd.in/ecbpBSpB, a model compression technique that can quantize the transformer's key-value cache to just 3 bits without requiring training or finetuning and causing any compromise in model accuracy, all while achieving a faster runtime than the original LLMs.

As you might already know, LLMs store intermediate computations in something called a key-value cache — essentially a running memory of what the model has processed so far — and this cache grows linearly with the length of the input, eating up GPU memory fast.