Training a powerful model is only half the story. The real engineering challenge is serving it at scale — without burning your GPU budget or testing your users’ patience.

There is a peculiar irony at the heart of modern AI deployment. We spend months — sometimes years — and hundreds of millions of dollars coaxing a language model into something brilliant. Then we hand it to users who will abandon it if the first token doesn’t appear within a second. The model’s intelligence means nothing if the experience feels like watching a dial-up modem load a webpage.

Inference optimization is the discipline that closes this gap. It is, in many ways, more demanding than training: you cannot simply throw more compute at the problem and call it done. You have to be surgical. Every millisecond matters. Every memory byte is contested. And you have to do all of this while keeping outputs mathematically equivalent — or close enough — to the original model.

This post is a structured tour through the most effective techniques practitioners use today, organized by where in the stack they operate. Not all of them will apply to your situation, but understanding the full landscape helps you make better architectural decisions.

1. Understanding What’s Actually Slow

Before reaching for a solution, it helps to understand the anatomy of an inference call. When a user sends a prompt, two phases unfold: the prefill phase, where the model processes the entire input in parallel and produces the first token, and the decode phase, where it generates subsequent tokens one at a time.

These phases have fundamentally different bottlenecks. Prefill is compute-bound — you’re doing massive matrix multiplications over the whole context, and the GPU’s arithmetic units are the limiting factor. Decode is memory-bandwidth-bound — at each step, the model loads billions of weights from GPU memory just to produce a single token. On most hardware, memory bandwidth is the scarcer resource, which is why decoding on large models feels sluggish even on expensive GPUs.

This distinction matters enormously for picking your optimization strategy. Techniques that speed up matrix operations (like quantization) help during prefill. Techniques that reduce memory movement (like KV cache compression) help during decode. Conflating the two leads to wasted effort.

“A model’s intelligence means nothing if the experience feels like watching a dial-up modem. Inference optimization is the discipline that closes this gap.”

2. Quantization: Shrinking Numbers, Not Capability

Neural network weights are stored as floating-point numbers. By default, most models trained today use BF16 (16-bit brain float), meaning each parameter consumes two bytes of memory. A 70-billion-parameter model, therefore, requires roughly 140 GB — more than what fits on a single high-end GPU. Quantization swaps these high-precision floats for smaller integers without retraining the model from scratch.

Post-Training Quantization (PTQ)

The most accessible form of quantization works on an already-trained model. You run a small calibration dataset through the network, observe how the activations are distributed, and then choose integer ranges that minimize the representation error. Tools like GPTQ, AWQ, and llm.int8() have made this approachable enough that practitioners routinely quantize frontier models in an afternoon.

# Example - loading a 4-bit GPTQ model with Transformers

from transformers import AutoModelForCausalLM, GPTQConfig
quantization_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer
)
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=quantization_config,
    device_map="auto"
)

The practical tradeoff: INT8 quantization typically costs 0.5–1.5 perplexity points on standard benchmarks, while INT4 can cost 1–3 points depending on the model and task. For most production use cases — summarization, classification, instruction-following — this is imperceptible to end users. For high-stakes tasks like code generation or mathematical reasoning, you should benchmark carefully before committing.

Quantization-Aware Training (QAT)

If you control the training pipeline, QAT bakes quantization into the learning process itself. The model learns to compensate for reduced precision during training, producing weights that are intrinsically more robust to being stored as integers. The downside is obvious: it requires access to training compute. But models trained with QAT consistently outperform PTQ equivalents at the same bit-width, sometimes matching full-precision quality at INT4.

3. KV Cache: The Hidden Memory Hog

Transformer architectures avoid recomputing attention over previous tokens by caching the Key and Value tensors from each attention layer — the so-called KV cache. This is a fundamental optimization baked into every production inference engine. Without it, generating a 1,000-token response would require reprocessing the entire prefix at each step, making decode times scale quadratically with length.

But the KV cache comes with a memory cost that grows linearly with sequence length, batch size, and number of layers. For a large model serving hundreds of concurrent users with long contexts, the cache can easily consume more GPU memory than the model weights themselves.

The insight behind vLLM’s PagedAttention is borrowed from operating system memory management. Traditional inference engines pre-allocate a contiguous memory block for the maximum possible KV cache size per sequence, causing massive fragmentation — unused memory that can’t be reclaimed. PagedAttention stores KV cache in non-contiguous “pages,” allocating memory only as it’s actually needed, and enabling fine-grained sharing when multiple sequences attend over the same prefix (a common pattern in few-shot prompting).

In practice, PagedAttention-based engines can serve 2–4× more concurrent requests on the same hardware compared to naive implementations, which translates directly into lower cost per token.

Prefix Caching

When many requests share a common prefix — a system prompt, a fixed context document, a tool schema — recomputing the KV cache for that prefix on every request is pure waste. Prefix caching (also called prompt caching) stores the KV state for frequently seen prefixes and reuses it across requests. This can eliminate 60–80% of prefill compute for applications with stable system prompts, dramatically reducing time-to-first-token.

📦PagedAttention

OS-inspired memory paging for KV cache. Eliminates fragmentation, enables prefix sharing. 2–4× throughput on concurrent workloads.

♻️Prefix Caching

Store and reuse KV states for shared prompt prefixes. Cuts TTFT dramatically for applications with stable system prompts.

🗜️KV Quantization

Apply INT8 or FP8 to the cache itself, not just model weights. Halves the cache memory with minimal quality impact.

✂️Cache Eviction

Streaming-LLM and H2O selectively drop cache entries for tokens with low attention scores, enabling very long contexts on limited memory.

4. Speculative Decoding: Letting a Smaller Model Do the Legwork

Speculative decoding is one of the more elegant ideas to emerge from inference research in recent years. The core intuition: token generation is sequential, but verification is not. If you have a fast small model (the draft model) and a slow large model (the target model), you can ask the draft model to generate several tokens ahead, then let the target model verify all of them in a single parallel forward pass.

When the draft is correct — which happens surprisingly often for predictable or formulaic outputs — you’ve produced multiple tokens at the cost of one. When it’s wrong, you fall back to the target model’s token and continue. The target model’s outputs are always used when the draft makes mistakes, so the final output distribution is identical to running the target alone.

Real-world speedups depend heavily on the acceptance rate, which varies by task. Code generation and structured outputs tend to have high acceptance rates (70–90%), yielding 3–4× speedups. Open-ended creative generation is harder to predict, with acceptance rates closer to 50–60%. Some implementations use the model itself as the draft via layer-skipping or early exit, eliminating the need for a separate draft model.

5. Batching Strategies: Keeping the GPU Fed

GPUs are at their most efficient when processing many requests in parallel — a single request to a large model leaves most arithmetic units idle. The challenge is that language model requests arrive unpredictably and have wildly varying lengths, making naive static batching inefficient.

Continuous Batching

Traditional static batching waits for a full batch to accumulate before processing, meaning every sequence in the batch must finish before any new one can begin. With continuous batching (also called in-flight batching), requests join and leave the batch dynamically. As one sequence finishes generating, a new one immediately takes its slot. This is the single highest-leverage batching improvement available and is now standard in production serving frameworks like vLLM and TGI.

Chunked Prefill

Long prompts during prefill can monopolize the GPU and stall decode for other ongoing requests — a phenomenon called prefill stalls. Chunked prefill splits large prefills into smaller chunks, interleaving them with decode steps from other sequences. This reduces tail latency significantly for mixed workloads where some users send long system prompts while others are mid-conversation.

6. Model Architecture Optimizations

Some of the most durable inference gains come from architectural choices made at design time, not post-hoc optimizations. Understanding these helps you evaluate model families and make informed deployment decisions.

Multi-Query and Grouped-Query Attention

Standard multi-head attention (MHA) uses separate Key and Value projections for each attention head. Multi-query attention (MQA) collapses all heads to share a single K/V pair, dramatically shrinking the KV cache at the cost of some modeling capacity. Grouped-query attention (GQA) — used in Llama-3, Mistral, and most modern architectures — finds the middle ground: heads are grouped, with each group sharing K/V pairs. GQA typically recovers most of MHA’s quality while achieving KV cache reductions close to MQA.

Flash Attention

Standard attention implementations materialize the full N×N attention score matrix in GPU SRAM, which becomes the bottleneck for long sequences. FlashAttention tiles the computation into blocks that fit in fast on-chip memory, avoiding slow reads and writes to high-bandwidth memory (HBM). The result is exact attention (not approximate) that is 2–4× faster and uses O(N) rather than O(N²) memory for the attention matrix. FlashAttention 3 further exploits hardware features of H100-class GPUs to push utilization above 75% on long-context workloads.

Mixture of Experts (MoE)

Mixture of Experts architectures replace dense feed-forward layers with a collection of specialized sub-networks (“experts”), routing each token to only a subset. A 70B-parameter MoE model might activate only 12–14B parameters per token. This means you can serve a model with the quality of a large dense model at the compute cost of a much smaller one — provided you have enough memory to hold all experts simultaneously. Memory capacity, not compute, becomes the binding constraint.

“The best inference optimization is often not a runtime trick but a smarter architecture decision made six months earlier during model design.”

7. Kernel Fusion and Hardware-Specific Tuning

Modern deep learning frameworks like PyTorch execute operations as sequences of discrete GPU kernels. Each kernel launch has overhead, and intermediate results must be written to and read from memory between steps. Kernel fusion combines adjacent operations into a single kernel, eliminating this overhead.

torch.compile in PyTorch 2.x automates much of this, using a JIT compiler to trace and optimize computation graphs at runtime. For models that benefit from graph-level optimizations, compile can provide 15–40% speedups with minimal code changes. More aggressive options include compiling to TensorRT (NVIDIA) or CoreML (Apple Silicon), which apply hardware-specific transformations like weight compression, operator fusion, and memory layout optimization.

Tensor Parallelism for Multi-GPU Serving

When a model doesn’t fit on a single GPU, the simplest solution is pipeline parallelism: different layers live on different GPUs, and activations pass sequentially between them. A more communication-efficient alternative is tensor parallelism, which splits individual weight matrices across GPUs so each device performs a slice of every matrix multiplication simultaneously. Megatron-LM-style tensor parallelism scales well to 8 GPUs within a node (where NVLink provides fast interconnect) and is the standard approach for serving models in the 70B+ range.

8. Putting It Together: A Decision Framework

With this many options available, the hard part is knowing which levers to pull first. Here’s a mental model that has served well across different deployment contexts:

Profile first. Before optimizing anything, measure where time actually goes. Is your bottleneck TTFT (prefill), throughput (decode), or memory capacity? Tools like nsys, ncu, and serving framework dashboards will tell you.
Enable continuous batching. If you’re not doing this, it’s the highest-leverage change with the lowest quality risk. Start here.
Quantize to INT8 or INT4. If memory is your constraint and you’re not on a tiny model where quality margins are thin, quantize. Validate on your specific task distribution.
Add prefix caching. If your application uses a fixed system prompt (most do), prefix caching is free performance. Enable it in your serving layer.
Consider speculative decoding. If your use case generates predictable outputs (code, structured data, templated text), speculative decoding can deliver substantial throughput gains.
Evaluate architecture choices for new deployments. If you’re selecting or training a new model, prefer GQA-based architectures and verify FlashAttention compatibility upfront.

Not all optimizations compose cleanly. Quantization interacts with speculative decoding in non-obvious ways. Tensor parallelism changes the communication pattern for batching. Test your chosen combination end-to-end before committing to it in production.

Where Things Are Heading

The inference optimization landscape moves fast. A few directions worth watching: FP8 training and serving is becoming mainstream on H100s and AMD MI300X, offering a sweet spot between FP16 quality and INT8 speed. Continuous speculative decoding (where the draft model runs asynchronously on a separate stream) is reducing the latency overhead of verification. And disaggregated prefill/decode — routing prefill and decode to separate, specialized hardware pools — is starting to appear in large-scale serving infrastructure at major labs.

The underlying pressure will only intensify. Context windows are growing. Agentic systems run multiple inference calls per user interaction. And the frontier of what users expect from response latency keeps moving downward. Inference optimization is no longer an afterthought — it’s a core engineering discipline, as critical to the user experience as model quality itself.