How to Reduce LLM Latency by 40%
Practical techniques to cut LLM response times — from prompt compression and caching to streaming and model selection — without sacrificing output quality.
Supporting Guide for: AI Cost Reduction & LLM Optimisation
How to Reduce LLM Latency by 40%
Every millisecond your LLM takes to respond is a millisecond your users spend staring at a spinner. In production systems, latency compounds. A chatbot that takes 4 seconds per response feels sluggish. An AI agent that chains 6 calls together at 3 seconds each takes 18 seconds to complete a single task. Users notice. Conversion drops. Support tickets rise.
The good news is that most production LLM systems are leaving enormous performance on the table. Through a combination of prompt engineering, caching, architecture changes, and model selection, we routinely help clients cut their P95 latency by 40% or more — often without any measurable drop in output quality. Here is how.
Prompt Compression — Less In, Faster Out
The single biggest lever you have over latency is input token count. LLM inference time scales roughly linearly with the number of input tokens, so a prompt that is half the size will process in roughly half the time.
Remove redundant context. Most production prompts accumulate cruft over time. System prompts grow as teams paste in edge-case instructions. Few-shot examples multiply. The first step is auditing your prompts and removing anything that does not measurably improve output quality. We have seen system prompts shrink from 2,000 tokens to 400 after a focused review, with no quality regression on the test suite.
Use structured compression. Techniques like LLMLingua and similar prompt compression libraries can reduce prompt length by 30–50% while preserving semantic meaning. These tools identify and remove tokens that contribute the least to the model's understanding. For retrieval-augmented generation (RAG) systems, compressing the retrieved context chunks before injection can dramatically reduce input size.
Shorten few-shot examples. If you are using few-shot prompting, each example adds significant token overhead. Consider whether a single well-chosen example achieves the same quality as five mediocre ones. In many cases, it does.
KV Caching — Stop Recomputing What You Already Know
Every time you send a request to an LLM, the model processes the entire input from scratch. If your system prompt is 500 tokens long and identical across every request, you are paying the computational cost of processing those 500 tokens on every single call.
Prompt caching (supported natively by Anthropic and available through various providers) stores the key-value computations for static prompt prefixes. Subsequent requests that share the same prefix skip the computation entirely. For systems with long, stable system prompts, this alone can cut time-to-first-token by 50–80%.
Application-level response caching is equally important. If your LLM is answering the same question repeatedly — and in most production systems, a surprisingly large fraction of queries are near-duplicates — you should cache the full response. A Redis layer with semantic similarity matching (using embeddings to identify "close enough" queries) can serve cached responses in under 10ms instead of the 2–4 seconds a fresh LLM call would take.
Speculative Decoding — The Speed Trick That Actually Works
Speculative decoding is one of the most underused latency optimisations available today. The concept is simple: a small, fast "draft" model generates candidate tokens, and the large "target" model verifies them in parallel. Since verification is cheaper than generation, and the draft model gets most tokens right, you get the quality of the large model at closer to the speed of the small one.
In practice, speculative decoding can deliver 2–3x throughput improvement on self-hosted models. It works best when the draft model is well-matched to your use case (fine-tuned on similar data) and when the output is somewhat predictable (structured JSON, templated responses, code generation). It works less well for highly creative or unpredictable outputs.
Streaming Responses — Perceived vs Actual Latency
Sometimes the fastest path is not making the model faster but making the user feel like it is faster. Streaming sends tokens to the client as they are generated rather than waiting for the full response. The time-to-first-token becomes the perceived latency, which is typically 200–500ms instead of the full 2–4 second generation time.
For chat interfaces, streaming is essentially mandatory. For API-to-API calls within a pipeline, streaming is less relevant — but you can still benefit from it by beginning downstream processing on partial outputs before the full response arrives.
Batching Strategies — Throughput vs Latency
If you are self-hosting models, continuous batching (also called iteration-level batching) is a critical optimisation. Unlike naive batching, which waits for an entire batch to complete before returning any results, continuous batching allows requests to enter and exit the processing pipeline independently. This keeps GPU utilisation high without penalising individual request latency.
Frameworks like vLLM, TensorRT-LLM, and SGLang implement continuous batching out of the box. If you are running a self-hosted inference server without continuous batching, you are likely leaving 3–5x throughput on the table.
Model Selection — Right-Size Your Intelligence
Not every request needs GPT-4-class intelligence. A classification task, a simple extraction, or a formatting conversion can often be handled by a model that is 10x faster and 50x cheaper.
The routing pattern uses a lightweight classifier (or even a small LLM) to examine incoming requests and route them to the appropriate model. Simple queries go to a fast, cheap model. Complex queries go to the flagship model. In practice, 60–80% of production traffic can typically be handled by smaller models, and overall system latency drops dramatically.
Before and after metrics from a recent client engagement: a customer support chatbot was running all queries through Claude Opus at an average latency of 3.2 seconds. After implementing a router that sent straightforward FAQ-style questions to Claude Haiku (average latency 0.4 seconds) and only escalated genuinely complex queries to Opus, the overall P50 latency dropped from 3.2 seconds to 0.7 seconds — a 78% improvement. The P95 dropped from 6.1 seconds to 3.4 seconds, as complex queries still required the larger model but were no longer waiting behind simple ones.
Infrastructure Optimisations — The Last 20%
For self-hosted deployments, hardware and configuration choices matter enormously.
Quantisation reduces model precision from FP16 to INT8 or INT4, cutting memory requirements and increasing inference speed by 30–70% with minimal quality loss on most tasks. AWQ and GPTQ are the most widely used quantisation methods, and both are well-supported by modern inference frameworks.
GPU memory bandwidth is often the true bottleneck for LLM inference, not raw compute. The NVIDIA H100 offers 3.35 TB/s of memory bandwidth compared to the A100's 2.0 TB/s. For inference-heavy workloads, the H100's bandwidth advantage matters more than its compute advantage.
Tensor parallelism across multiple GPUs reduces per-request latency by splitting the model across devices. This is distinct from data parallelism (which improves throughput) and is essential for serving models that do not fit in a single GPU's memory.
Putting It All Together
A 40% latency reduction is conservative. Most production systems we audit have so much low-hanging fruit that the first round of optimisations delivers 50–70% improvement. The key is measuring properly — track P50, P95, and P99 latency separately, measure time-to-first-token distinctly from total generation time, and always A/B test changes against a quality benchmark.
Start with prompt compression and caching because they require no infrastructure changes. Move to model routing next because it delivers the largest single improvement. Then tackle infrastructure optimisations if you are self-hosting, or evaluate whether self-hosting makes sense if you are currently on API-only.
The compound effect of these techniques is what gets you from "our chatbot feels slow" to "our chatbot feels instant." And in production AI, that difference is the difference between a product people tolerate and a product people love.
Ready to implement this?
We help founders master vibe coding at scale. Book a Free Technical Triage to unblock your build.