# UC Berkeley researchers solved the biggest bottleneck in LLM inference. 85K s...
Canonical: https://social-archive.org/yena/lGd1KjQeau
Original URL: https://x.com/techNmak/status/2072311649374192113
Author: Tech with Mak
Platform: x
## Content
UC Berkeley researchers solved the biggest bottleneck in LLM inference. 85K stars. 2,861 contributors. It's called vLLM, and it completely rethinks how memory is allocated during generation. Traditional serving wastes VRAM due to fragmented key-value states. vLLM fixes this with PagedAttention. It efficiently manages attention key and value memory just like an operating system manages virtual memory. What this architecture delivers: → 24x better memory efficiency → Continuous batching of incoming requests → Fast model execution with CUDA/HIP graphs → State-of-the-art serving throughput pip install vllm Open-source. Apache-2.0 license. If you're serving LLMs without this, you're burning money.
