PagedAttention: Fixing the Real Bottleneck in LLM Serving

2025-09-24

Here is the question:

“What’s the real bottleneck in LLM serving throughput? And how can PagedAttention help?”

Most people say compute or FLOPs.
But the truth is: the bottleneck is memory, specifically the KV cache.

This post will cover:

The memory wall of LLM serving
How PagedAttention works
Benchmarks: Hugging Face vs vLLM
Memory usage analysis
Deployment & engineering insights
Future directions

1. The Memory Wall of LLM Serving

LLM serving isn’t compute-bound—it’s memory-bound.
Here’s a typical GPU memory breakdown:

65% → Model weights
30% → KV cache
5% → Activations

The KV cache (keys and values from past tokens) grows linearly with sequence length and batch size.

Formula for KV cache memory:

Memory_KV ≈ num_layers × num_heads × head_dim × seq_len × batch_size × 2

Example: OPT-7B

32 layers, 32 heads, head_dim = 128
seq_len = 4096, batch = 16
Memory_KV ≈ 24 GB

That’s just for KV cache, not weights!
This is why GPUs “run out of memory” far before compute maxes out.

2. The PagedAttention Breakthrough

vLLM introduced PagedAttention, inspired by operating systems.

Just like an OS uses virtual memory with pages, PagedAttention splits the KV cache into fixed-size blocks that don’t need to be contiguous.

Memory fragmentation drops to near zero
Allocations are dynamic and block-based
GPU memory is used far more efficiently

How It Works

KV cache divided into blocks (pages)
Attention computed block by block
Logical → physical block mapping via a block table
Free blocks instantly when a sequence finishes

No more over-provisioning or wasted slots.

3. Benchmark: Hugging Face vs vLLM

We ran the same model (facebook/opt-125m) on the same GPU.

vLLM with PagedAttention

import os, time
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM, SamplingParams

llm = LLM(model="facebook/opt-125m")
params = SamplingParams(max_tokens=32)
prompt = "Explain why KV cache becomes the bottleneck in LLM serving."

start = time.time()
outputs = llm.generate([prompt], params)
print("Latency:", time.time() - start)

Output: Latency: 0.2919800281524658 seconds

Hugging Face Baseline

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, time

model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).cuda()

prompt = "Explain why KV cache becomes the bottleneck in LLM serving."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

start = time.time()
outputs = model.generate(**inputs, max_new_tokens=256)
print("Latency:", time.time() - start)

Output: Latency: 2.889211416244507 seconds

Result

Hugging Face (naïve KV management): ~2.89s
vLLM + PagedAttention: ~0.29s

That’s a 10× speedup even on a small model. On larger models and longer sequences, the gap widens.

## 4. Memory Usage Analysis How memory gets wasted without PagedAttention

Reserved slots for future tokens

Internal fragmentation (allocate 2048, use 200)

External fragmentation from buddy allocators

Studies show only 20–38% of KV memory actually stores token states.

Quick GPU Memory Check

You can measure GPU memory with PyTorch:

print("Allocated:", torch.cuda.max_memory_allocated() / 1e9, "GB")
print("Reserved:", torch.cuda.max_memory_reserved() / 1e9, "GB")

Compare Hugging Face vs vLLM: you’ll see vLLM uses much less memory per token.

5. Deployment & Engineering Insights

PagedAttention doesn’t just make research benchmarks faster—it changes serving system design.

Throughput vs Latency:

*Latency improves modestly

*Throughput (batch size) improves dramatically

Beam Search & Sampling

*Shared prefixes cached once

*Copy-on-write for parallel decoding

Production Integration

*Works with vLLM server (python -m vllm.entrypoints.api_server)

*Can be deployed standalone or behind Triton

6. Block Size Trade-Offs

Choosing block size matters:

Too small - poor GPU utilization

Too large - fragmentation creeps back

The sweet spot: ≈16 tokens per block.

7. Future Directions

PagedAttention is powerful, but it’s just the beginning.

PagedAttention + FlashAttention = memory + compute optimized

Multi-GPU sharding - distribute KV blocks across GPUs

CPU offloading - run bigger models on consumer GPUs

Scheduling policies - FIFO vs priority serving in production

Final Takeaway

The real bottleneck in LLM serving isn’t FLOPs, it’s KV cache fragmentation. PagedAttention fixes this with OS-style block management, unlocking:

*Higher throughput

*Larger batch sizes

*Efficient memory sharing

With real benchmarks showing ~10× latency reduction, PagedAttention is setting the new standard for LLM serving.