Measuring LLM Inference Performance: Beyond Tokens per Second

2025-09-24

One of the toughest ML Engineer interview questions is:

"How do you measure LLM inference performance? What metrics matter most for production systems?"

Most candidates only say tokens per second (TPS).
That answer is incomplete.

This post explains the critical metrics you must know to measure and optimize LLM inference in real production systems.

1) Core User Experience Metrics

Time to First Token (TTFT)

How long the user waits before seeing the first token.

Low TTFT = snappy user experience
High TTFT = perceived slowness

Targets:

GPT-4o: ~200ms
Gemini: ~300ms
Typical prototypes: often 2s or more

Time Per Output Token (TPOT)

How smooth the streaming feels after the first token.

~4 tokens/second = matches human reading speed
< 2 tokens/second = sluggish
> 8 tokens/second = usually overkill

Token Generation Time

Time between the first and last token.
Critical for long-form answers.

Formula:
Token Generation Time = number_of_tokens * TPOT

End-to-End Latency (E2E)

The complete user-perceived latency.

Formula:
E2E Latency = TTFT + Token Generation Time

Even if TTFT is fast, slow generation can ruin the experience.

2) Prefill vs Decode

LLM inference has two distinct stages:

Prefill (TPIT): processing input tokens. A bottleneck for large prompts.
Decode (TPOT): generating output tokens.

You should measure prefill throughput and decode throughput separately to diagnose latency issues.

3) Latency Percentiles: P50 vs P99

Latency is not uniform. Always look at percentiles:

P50 (median): what half of users experience
P99: what the worst 1% experience

Example:

P50 TTFT = 120ms → feels great
P99 TTFT = 8s → unacceptable for production

Production systems must control tail latency, not just averages.

4) Requests vs Tokens

RPS (Requests Per Second): how many user requests you serve
TPS (Tokens Per Second): how many tokens you generate

Interpretation:

High RPS, low TPS = lots of short requests
Low RPS, high TPS = fewer, long documents

Both matter depending on workload.

5) Goodput

Throughput is meaningless if requests fail.

Formula:
Goodput = Successful tokens per second

Example:

Raw throughput: 1000 TPS
20% requests timeout
Goodput = 800 TPS

Production systems care about goodput, not raw throughput.

6) RAG Pipeline Timing

Most LLM apps use retrieval-augmented generation (RAG). You must budget time across the full pipeline:

Retrieval latency (vector search, cache hit rate)
Reranker latency
Prompt assembly
LLM inference (TTFT, TPIT, TPOT, E2E)
Post-processing (validation, guardrails)
Network overhead

E2E measurement should always cover the entire pipeline, not just the model.

7) Operational Metrics (SLI and SLO)

Define metrics and targets clearly:

SLI (Service Level Indicator): measured values like P99 TTFT, goodput, timeout rate
SLO (Service Level Objective): targets like "P99 TTFT < 250ms for 30 days with 99.5% success rate"

Other useful metrics:

Timeout rate
Retry rate
Guardrail block rate
Cache hit ratio

Monitoring dashboards should show P50/P95/P99 TTFT, goodput vs load, and batch size vs latency curves.

8) Optimization Playbook

Goal	Lever	Effect	Tradeoff
Lower TTFT	Smaller batching window	Faster first token	Lower throughput
Lower E2E	Speculative decoding	Faster generation	Overhead if drafts discarded
Higher TPIT	Prompt/prefix caching	Faster prefill	Cache invalidation complexity
Higher TPOT	Flash/Paged Attention	Faster decoding	Memory and hardware limits
Higher Goodput	Admission control	Fewer timeouts	Some requests rejected
Lower cost	Quantization (INT8/FP8)	Cheaper per token	Possible quality drop

9) Benchmarking Pitfalls & Checklist

Common mistakes:

Confusing stream chunks with tokens
Reporting only averages instead of P95/P99
Hiding cold-start penalties
Not documenting batch policy (window size, max batch)

Checklist for fair benchmarks:

Model version, quantization
Serving framework (vLLM, TGI, Triton)
Hardware (GPU type, memory, interconnect)
Workload distribution (chat/doc/tool-call)
Concurrency level
Percentiles (P50/P95/P99)
Cache state (warm vs cold)
Grammar/JSON enforcement enabled?

10) The Interview Trap

A vague claim like:
"Our system handles 1000 TPS."

The interviewer will ask:

Input or output TPS?
What batch size?
Mean or P99 latency?
How many concurrent users?

Example Strong Answers

Q: "Why is P99 TTFT high?"
A: "Batch window was 30ms, which caused long queueing. Reducing it to 15ms lowered P99 TTFT from 910ms to 420ms. Increasing prefix cache hit rate from 35% to 60% also helped."

Q: "You said 1000 TPS. Input or output?"
A: "Output TPS. Average 400 output tokens per request. Batch size 16, dynamic window 12ms. P95 E2E = 2.1s, P99 = 3.7s. Goodput = 99.2%."

Q: "What about speculative decoding?"
A: "It increased generation speed by 35%, but JSON-enforced requests had draft cancellation overhead, so we disable it for strict JSON workloads."

11) Example Benchmarking Code (Python)

import time
from openai import OpenAI

client = OpenAI()
prompt = "Write a short story about a bee that becomes a data scientist."

start_time = time.time()
stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    stream=True
)

first_token_time = None
last_token_time = None
token_count = 0

for event in stream:
    if event.type == "message.delta" and event.delta.content:
        now = time.time()
        if first_token_time is None:
            first_token_time = now
        last_token_time = now
        token_count += 1

ttft = first_token_time - start_time
e2el = last_token_time - start_time
tpot = (last_token_time - first_token_time) / token_count

print(f"TTFT: {ttft:.3f}s")
print(f"E2E: {e2el:.3f}s")
print(f"TPOT: {tpot:.3f}s/token (~{1/tpot:.1f} tok/s)")
print(f"Total tokens: {token_count}")

12) Use Case Targets

Chatbots: TTFT < 200ms, E2E < 2s

Code completion: TTFT < 100ms

Document processing: optimize throughput, not latency

Realtime copilots: TTFT < 150ms, TPOT >= 3 tok/s

Final Takeaways

TPS is not enough.

You must track TTFT, TPIT, TPOT, E2E latency, Goodput, and percentiles.

Benchmarks must be transparent, reproducible, and workload-aware.

Production systems need operational SLOs, monitoring, and tradeoff decisions.

In interviews, always give precise, context-aware answers with batch size, percentiles, and concurrency.

This is how production-scale LLM systems are designed :)