# Google just published TurboQuant https://lnkd.in/ecbpBSpB, a model compressio...
Canonical: https://social-archive.org/nbluemer/hsQ6r24xt4
Original URL: https://www.linkedin.com/posts/andriyburkov_google-just-published-turboquant-https-share-7442701043118866433--CDY/
Author: Andriy Burkov
Platform: linkedin
## Content
Google just published TurboQuant https://lnkd.in/ecbpBSpB, a model compression technique that can quantize the transformer's key-value cache to just 3 bits without requiring training or finetuning and causing any compromise in model accuracy, all while achieving a faster runtime than the original LLMs. As you might already know, LLMs store intermediate computations in something called a key-value cache — essentially a running memory of what the model has processed so far — and this cache grows linearly with the length of the input, eating up GPU memory fast. The standard fix is quantization: representing each number with fewer bits (say 4 instead of 16), which shrinks the cache but introduces error. The problem is that most quantization methods need to store extra correction constants for every small block of data, and those constants themselves take up 1–2 bits per number, which is a lot when you're trying to get down to 3 or 4 bits total. TurboQuant gets around this by combining two ideas: PolarQuant, which converts vectors into a coordinate system (radius and angle instead of x/y/z) where the data distribution is predictable enough, and QJL, a 1-bit scheme that needs only a single sign bit per number with no extra correction constants stored alongside — so the 1 bit is the entire cost. PolarQuant does the heavy lifting as the first compression stage, and QJL is then applied to the small residual error PolarQuant leaves behind — not to eliminate it, but to remove its systematic bias so it doesn't consistently skew attention scores. The papers report compressing the cache to 3 bits **with no measurable accuracy loss** on standard benchmarks and up to 8x speedup on attention computation, and the methods come with actual theoretical guarantees. All three papers are available on ChapterPal to read with an AI tutor. Suggested reading order: QJL https://lnkd.in/ehcFhGQU first, then PolarQuant https://lnkd.in/eq4xybJr, then TurboQuant. The first two are independent building blocks, and TurboQuant https://lnkd.in/ehcFhGQU combines them.
