Measuring LLM Inference Performance: Beyond Tokens per Second
One of the toughest ML Engineer interview questions is:
"How do you measure LLM inference performance? What metrics matter most for production systems?"
Most candidates only say tokens per second (TPS).
That answer is incomplete.
This post explains the critical metrics you must know to measure and optimize LLM inference in real production systems.
1) Core User Experience Metrics
Time to First Token (TTFT)
How long the user waits before seeing the first token.
- Low TTFT = snappy user experience
 - High TTFT = perceived slowness
 
Targets:
- GPT-4o: ~200ms
 - Gemini: ~300ms
 - Typical prototypes: often 2s or more
 
Time Per Output Token (TPOT)
How smooth the streaming feels after the first token.
- ~4 tokens/second = matches human reading speed
 - < 2 tokens/second = sluggish
 - > 8 tokens/second = usually overkill
 
Token Generation Time
Time between the first and last token.
Critical for long-form answers.
Formula:
Token Generation Time = number_of_tokens * TPOT
End-to-End Latency (E2E)
The complete user-perceived latency.
Formula:
E2E Latency = TTFT + Token Generation Time
Even if TTFT is fast, slow generation can ruin the experience.
2) Prefill vs Decode
LLM inference has two distinct stages:
- Prefill (TPIT): processing input tokens. A bottleneck for large prompts.
 - Decode (TPOT): generating output tokens.
 
You should measure prefill throughput and decode throughput separately to diagnose latency issues.
3) Latency Percentiles: P50 vs P99
Latency is not uniform. Always look at percentiles:
- P50 (median): what half of users experience
 - P99: what the worst 1% experience
 
Example:
- P50 TTFT = 120ms → feels great
 - P99 TTFT = 8s → unacceptable for production
 
Production systems must control tail latency, not just averages.
4) Requests vs Tokens
- RPS (Requests Per Second): how many user requests you serve
 - TPS (Tokens Per Second): how many tokens you generate
 
Interpretation:
- High RPS, low TPS = lots of short requests
 - Low RPS, high TPS = fewer, long documents
 
Both matter depending on workload.
5) Goodput
Throughput is meaningless if requests fail.
Formula:
Goodput = Successful tokens per second
Example:
- Raw throughput: 1000 TPS
 - 20% requests timeout
 - Goodput = 800 TPS
 
Production systems care about goodput, not raw throughput.
6) RAG Pipeline Timing
Most LLM apps use retrieval-augmented generation (RAG). You must budget time across the full pipeline:
- Retrieval latency (vector search, cache hit rate)
 - Reranker latency
 - Prompt assembly
 - LLM inference (TTFT, TPIT, TPOT, E2E)
 - Post-processing (validation, guardrails)
 - Network overhead
 
E2E measurement should always cover the entire pipeline, not just the model.
7) Operational Metrics (SLI and SLO)
Define metrics and targets clearly:
- SLI (Service Level Indicator): measured values like P99 TTFT, goodput, timeout rate
 - SLO (Service Level Objective): targets like "P99 TTFT < 250ms for 30 days with 99.5% success rate"
 
Other useful metrics:
- Timeout rate
 - Retry rate
 - Guardrail block rate
 - Cache hit ratio
 
Monitoring dashboards should show P50/P95/P99 TTFT, goodput vs load, and batch size vs latency curves.
8) Optimization Playbook
| Goal | Lever | Effect | Tradeoff | 
|---|---|---|---|
| Lower TTFT | Smaller batching window | Faster first token | Lower throughput | 
| Lower E2E | Speculative decoding | Faster generation | Overhead if drafts discarded | 
| Higher TPIT | Prompt/prefix caching | Faster prefill | Cache invalidation complexity | 
| Higher TPOT | Flash/Paged Attention | Faster decoding | Memory and hardware limits | 
| Higher Goodput | Admission control | Fewer timeouts | Some requests rejected | 
| Lower cost | Quantization (INT8/FP8) | Cheaper per token | Possible quality drop | 
9) Benchmarking Pitfalls & Checklist
Common mistakes:
- Confusing stream chunks with tokens
 - Reporting only averages instead of P95/P99
 - Hiding cold-start penalties
 - Not documenting batch policy (window size, max batch)
 
Checklist for fair benchmarks:
- Model version, quantization
 - Serving framework (vLLM, TGI, Triton)
 - Hardware (GPU type, memory, interconnect)
 - Workload distribution (chat/doc/tool-call)
 - Concurrency level
 - Percentiles (P50/P95/P99)
 - Cache state (warm vs cold)
 - Grammar/JSON enforcement enabled?
 
10) The Interview Trap
A vague claim like:
"Our system handles 1000 TPS."
The interviewer will ask:
- Input or output TPS?
 - What batch size?
 - Mean or P99 latency?
 - How many concurrent users?
 
Example Strong Answers
Q: "Why is P99 TTFT high?"
A: "Batch window was 30ms, which caused long queueing. Reducing it to 15ms lowered P99 TTFT from 910ms to 420ms. Increasing prefix cache hit rate from 35% to 60% also helped."
Q: "You said 1000 TPS. Input or output?"
A: "Output TPS. Average 400 output tokens per request. Batch size 16, dynamic window 12ms. P95 E2E = 2.1s, P99 = 3.7s. Goodput = 99.2%."
Q: "What about speculative decoding?"
A: "It increased generation speed by 35%, but JSON-enforced requests had draft cancellation overhead, so we disable it for strict JSON workloads."
11) Example Benchmarking Code (Python)
import time
from openai import OpenAI
client = OpenAI()
prompt = "Write a short story about a bee that becomes a data scientist."
start_time = time.time()
stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    stream=True
)
first_token_time = None
last_token_time = None
token_count = 0
for event in stream:
    if event.type == "message.delta" and event.delta.content:
        now = time.time()
        if first_token_time is None:
            first_token_time = now
        last_token_time = now
        token_count += 1
ttft = first_token_time - start_time
e2el = last_token_time - start_time
tpot = (last_token_time - first_token_time) / token_count
print(f"TTFT: {ttft:.3f}s")
print(f"E2E: {e2el:.3f}s")
print(f"TPOT: {tpot:.3f}s/token (~{1/tpot:.1f} tok/s)")
print(f"Total tokens: {token_count}")
12) Use Case Targets
Chatbots: TTFT < 200ms, E2E < 2s
Code completion: TTFT < 100ms
Document processing: optimize throughput, not latency
Realtime copilots: TTFT < 150ms, TPOT >= 3 tok/s
Final Takeaways
TPS is not enough.
You must track TTFT, TPIT, TPOT, E2E latency, Goodput, and percentiles.
Benchmarks must be transparent, reproducible, and workload-aware.
Production systems need operational SLOs, monitoring, and tradeoff decisions.
In interviews, always give precise, context-aware answers with batch size, percentiles, and concurrency.
- This is how production-scale LLM systems are designed :)