A tricky LLM interview question:
You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces.
So you add KV cache compression and evict 90% of the cached tokens.
VRAM usage stays as is and GPU still runs out of memory.
Why?
(answer below)
Evicting 90% of the KV cache can free almost none of the memory it was using.
This sounds counterintuitive, but it follows directly from how production servers store the cache today.
The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues.
This is the dominant memory cost for reasoning models.
If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU.
One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it.
But this does not solve the memory problem yet.
The reason is paged attention, which is the memory manager behind vLLM and most production servers.
Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens.
This block returns to the allocator only when every slot inside it is empty.
Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks...
...so despite eviction, almost every block is left with at least some survivor tokens.
For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token.
This means the allocator frees almost nothing.
Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout.
Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order.
Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds.
This introduces another bookkeeping cost that an in-order layout inherently avoids.
So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server.
There's another problem.
Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected).
But fast attention kernels used in production, like FlashAttention, never save those scores.
They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast.
So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide.
NVIDIA published a method called TriAttention to solve both these problems.
It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters.
For the memory problem, it runs a compaction pass every 128 decoded tokens.
The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order.
On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory.
KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens.
You can find the NVIDIA write-up here: https://research.nvidia.com/labs/eai/blogs/kv-cache-compression-and-its-infra-problems/
I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching.
Read it below.
LLM 인터뷰 질문 및 답변: KV 캐시 압축의 문제점
vLLM에서 추론 모델을 서비스할 때 긴 추적(Trace)에서 GPU 메모리가 부족해지는 문제를 해결하기 위해 KV 캐시 압축을 적용했음에도 불구하고 VRAM 사용량이 줄지 않는 이유는 다음과 같습니다.
해결책: NVIDIA의 TriAttention
NVIDIA는 이러한 문제를 해결하기 위해 TriAttention이라는 방법을 제안했습니다.