Avi Chawla on x

yena shared this post · 2h ago

Avi Chawla

13h ago · archived 2h ago

A tricky LLM interview question:

You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces.

So you add KV cache compression and evict 90% of the cached tokens.

VRAM usage stays as is and GPU still runs out of memory.

Why?

(answer below)

Evicting 90% of the KV cache can free almost none of the memory it was using.

This sounds counterintuitive, but it follows directly from how production servers store the cache today.

The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues.

This is the dominant memory cost for reasoning models.

If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU.

One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it.

But this does not solve the memory problem yet.

The reason is paged attention, which is the memory manager behind vLLM and most production servers.

Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens.

This block returns to the allocator only when every slot inside it is empty.

Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks...

...so despite eviction, almost every block is left with at least some survivor tokens.

For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token.

This means the allocator frees almost nothing.

Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout.

Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order.

Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds.

This introduces another bookkeeping cost that an in-order layout inherently avoids.

So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server.

There's another problem.

Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected).

But fast attention kernels used in production, like FlashAttention, never save those scores.

They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast.

So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide.

NVIDIA published a method called TriAttention to solve both these problems.

It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters.

For the memory problem, it runs a compaction pass every 128 decoded tokens.

The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order.

On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory.

KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens.

You can find the NVIDIA write-up here: https://research.nvidia.com/labs/eai/blogs/kv-cache-compression-and-its-infra-problems/

I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching.

Read it below.

GLM 4.7 Flash · Summary · 2h ago

LLM 인터뷰 질문 및 답변: KV 캐시 압축의 문제점

vLLM에서 추론 모델을 서비스할 때 긴 추적(Trace)에서 GPU 메모리가 부족해지는 문제를 해결하기 위해 KV 캐시 압축을 적용했음에도 불구하고 VRAM 사용량이 줄지 않는 이유는 다음과 같습니다.

페이지 어텐션의 특성: vLLM과 같은 프로덕션 서버는 GPU 메모리를 고정된 물리적 블록으로 나눕니다. 각 블록은 약 16개의 토큰을 저장하며, 블록 내의 모든 슬롯이 비어야만 할당자가 블록을 해제합니다.
토큰 분산으로 인한 해제 실패: 중요한 토큰만 남기고 나머지를 삭제하면, 삭제된 토큰이 블록에 흩어져 있어 대부분의 블록에 여전히 살아남은 토큰이 남게 됩니다. 이로 인해 할당자는 거의 메모리를 해제하지 못합니다.
순서 유지의 비용: 빈 슬롯에 새 토큰을 배치하면 캐시의 순서가 깨지므로, 각 슬롯이 실제 위치를 기록하는 별도의 관리 비용이 발생합니다.
속도 저하: FlashAttention과 같은 빠른 커널은 어텐션 점수를 저장하지 않으므로, 정확한 토큰 삭제 신호를 사용할 수 없어 느린 'eager attention'으로 대체해야 합니다.

해결책: NVIDIA의 TriAttention

NVIDIA는 이러한 문제를 해결하기 위해 TriAttention이라는 방법을 제안했습니다.

RoPE 적용 전 계산: 어텐션 점수가 필요 없으며, RoPE 적용 전 키와 쿼리 벡터의 기하학적 구조를 기반으로 토큰을 평가합니다.
압축 및 정렬: 128개의 토큰을 디코딩할 때마다 압축 패스를 실행하여 살아남은 토큰을 앞으로 이동시켜 빈 블록을 확보합니다.
효과: 긴 추론 추적에서 전체 어텐션 정확도를 유지하면서 디코딩 속도는 2.5배 빨라지고 KV 메모리 사용량은 10.7배 줄어듭니다.