Measuring LLM Inference Performance: Beyond Tokens per Second

2025-09-24

One of the toughest ML Engineer interview questions is:

"How do you measure LLM inference performance? What metrics matter most for production systems?"

Most candidates only say tokens per second (TPS).
That answer is incomplete.

This post explains the critical metrics you must know to measure and optimize LLM inference in real production systems.



1) Core User Experience Metrics

Time to First Token (TTFT)

How long the user waits before seeing the first token.

  • Low TTFT = snappy user experience
  • High TTFT = perceived slowness

Targets:

  • GPT-4o: ~200ms
  • Gemini: ~300ms
  • Typical prototypes: often 2s or more


Time Per Output Token (TPOT)

How smooth the streaming feels after the first token.

  • ~4 tokens/second = matches human reading speed
  • < 2 tokens/second = sluggish
  • > 8 tokens/second = usually overkill


Token Generation Time

Time between the first and last token.
Critical for long-form answers.

Formula:
Token Generation Time = number_of_tokens * TPOT



End-to-End Latency (E2E)

The complete user-perceived latency.

Formula:
E2E Latency = TTFT + Token Generation Time

Even if TTFT is fast, slow generation can ruin the experience.



2) Prefill vs Decode

LLM inference has two distinct stages:

  • Prefill (TPIT): processing input tokens. A bottleneck for large prompts.
  • Decode (TPOT): generating output tokens.

You should measure prefill throughput and decode throughput separately to diagnose latency issues.



3) Latency Percentiles: P50 vs P99

Latency is not uniform. Always look at percentiles:

  • P50 (median): what half of users experience
  • P99: what the worst 1% experience

Example:

  • P50 TTFT = 120ms → feels great
  • P99 TTFT = 8s → unacceptable for production

Production systems must control tail latency, not just averages.



4) Requests vs Tokens

  • RPS (Requests Per Second): how many user requests you serve
  • TPS (Tokens Per Second): how many tokens you generate

Interpretation:

  • High RPS, low TPS = lots of short requests
  • Low RPS, high TPS = fewer, long documents

Both matter depending on workload.



5) Goodput

Throughput is meaningless if requests fail.

Formula:
Goodput = Successful tokens per second

Example:

  • Raw throughput: 1000 TPS
  • 20% requests timeout
  • Goodput = 800 TPS

Production systems care about goodput, not raw throughput.



6) RAG Pipeline Timing

Most LLM apps use retrieval-augmented generation (RAG). You must budget time across the full pipeline:

  • Retrieval latency (vector search, cache hit rate)
  • Reranker latency
  • Prompt assembly
  • LLM inference (TTFT, TPIT, TPOT, E2E)
  • Post-processing (validation, guardrails)
  • Network overhead

E2E measurement should always cover the entire pipeline, not just the model.



7) Operational Metrics (SLI and SLO)

Define metrics and targets clearly:

  • SLI (Service Level Indicator): measured values like P99 TTFT, goodput, timeout rate
  • SLO (Service Level Objective): targets like "P99 TTFT < 250ms for 30 days with 99.5% success rate"

Other useful metrics:

  • Timeout rate
  • Retry rate
  • Guardrail block rate
  • Cache hit ratio

Monitoring dashboards should show P50/P95/P99 TTFT, goodput vs load, and batch size vs latency curves.



8) Optimization Playbook

GoalLeverEffectTradeoff
Lower TTFTSmaller batching windowFaster first tokenLower throughput
Lower E2ESpeculative decodingFaster generationOverhead if drafts discarded
Higher TPITPrompt/prefix cachingFaster prefillCache invalidation complexity
Higher TPOTFlash/Paged AttentionFaster decodingMemory and hardware limits
Higher GoodputAdmission controlFewer timeoutsSome requests rejected
Lower costQuantization (INT8/FP8)Cheaper per tokenPossible quality drop


9) Benchmarking Pitfalls & Checklist

Common mistakes:

  • Confusing stream chunks with tokens
  • Reporting only averages instead of P95/P99
  • Hiding cold-start penalties
  • Not documenting batch policy (window size, max batch)

Checklist for fair benchmarks:

  • Model version, quantization
  • Serving framework (vLLM, TGI, Triton)
  • Hardware (GPU type, memory, interconnect)
  • Workload distribution (chat/doc/tool-call)
  • Concurrency level
  • Percentiles (P50/P95/P99)
  • Cache state (warm vs cold)
  • Grammar/JSON enforcement enabled?


10) The Interview Trap

A vague claim like:
"Our system handles 1000 TPS."

The interviewer will ask:

  • Input or output TPS?
  • What batch size?
  • Mean or P99 latency?
  • How many concurrent users?

Example Strong Answers

Q: "Why is P99 TTFT high?"
A: "Batch window was 30ms, which caused long queueing. Reducing it to 15ms lowered P99 TTFT from 910ms to 420ms. Increasing prefix cache hit rate from 35% to 60% also helped."

Q: "You said 1000 TPS. Input or output?"
A: "Output TPS. Average 400 output tokens per request. Batch size 16, dynamic window 12ms. P95 E2E = 2.1s, P99 = 3.7s. Goodput = 99.2%."

Q: "What about speculative decoding?"
A: "It increased generation speed by 35%, but JSON-enforced requests had draft cancellation overhead, so we disable it for strict JSON workloads."



11) Example Benchmarking Code (Python)

import time
from openai import OpenAI

client = OpenAI()
prompt = "Write a short story about a bee that becomes a data scientist."

start_time = time.time()
stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    stream=True
)

first_token_time = None
last_token_time = None
token_count = 0

for event in stream:
    if event.type == "message.delta" and event.delta.content:
        now = time.time()
        if first_token_time is None:
            first_token_time = now
        last_token_time = now
        token_count += 1

ttft = first_token_time - start_time
e2el = last_token_time - start_time
tpot = (last_token_time - first_token_time) / token_count

print(f"TTFT: {ttft:.3f}s")
print(f"E2E: {e2el:.3f}s")
print(f"TPOT: {tpot:.3f}s/token (~{1/tpot:.1f} tok/s)")
print(f"Total tokens: {token_count}")

12) Use Case Targets

Chatbots: TTFT < 200ms, E2E < 2s

Code completion: TTFT < 100ms

Document processing: optimize throughput, not latency

Realtime copilots: TTFT < 150ms, TPOT >= 3 tok/s



Final Takeaways

TPS is not enough.

You must track TTFT, TPIT, TPOT, E2E latency, Goodput, and percentiles.

Benchmarks must be transparent, reproducible, and workload-aware.

Production systems need operational SLOs, monitoring, and tradeoff decisions.

In interviews, always give precise, context-aware answers with batch size, percentiles, and concurrency.

  • This is how production-scale LLM systems are designed :)