Skip to content

Chapter 63: Inference Economics

Cast of characters
NameLifespanRole
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon ChunOrca authors (Seoul National University / FriendliAI); iteration-level scheduling and selective batching for generative Transformer serving (OSDI 2022)
Tri DaoLead author of FlashAttention (Stanford); IO-aware exact attention, HBM/SRAM traffic as the bottleneck (May 2022)
Guangxuan XiaoLead author of SmoothQuant (MIT); post-training W8A8 quantization that migrates activation outliers to weights (Nov 2022)
Ying ShengLead author of FlexGen (Stanford); GPU/CPU/disk offload for high-throughput inference under limited GPU memory (Mar 2023)
Woosuk KwonLead author of vLLM/PagedAttention (Berkeley); KV-cache management for high-throughput serving (SOSP 2023)
Yaniv Leviathan, Matan Kalman, Yossi Matias; Charlie Chen et al.Speculative decoding/sampling authors (Google; DeepMind); faster generation through proposal-and-verification methods (2022/2023)
Timeline (May 2022 – 2024)
timeline
title Chapter 63 — Inference Economics
May 2022 : FlashAttention reframes attention speed around HBM/SRAM IO, not just FLOPs
Jul 2022 : Orca (OSDI) proposes iteration-level scheduling and selective batching for generative Transformer serving
Nov 2022 : SmoothQuant shows a post-training W8A8 path for LLM inference
Nov 2022 / Feb 2023 : Speculative decoding and speculative sampling papers target the serial decoding loop
Mar 2023 : FlexGen demonstrates GPU/CPU/disk offload for throughput-oriented generation on a single T4
Sep / Oct 2023 : vLLM / PagedAttention makes KV-cache management a first-class serving primitive
2024 : DistServe (OSDI) separates serving phases to improve SLO-compliant capacity
Plain-words glossary

Autoregressive generation — A language model produces output one token at a time, feeding each new token back as input for the next step. A long answer is therefore a chain of serial model passes, not a single forward computation. This is why generation latency compounds and why scheduling has to look at iterations rather than whole requests.

KV cache (key/value cache) — The per-request store of attention keys and values from prior tokens, kept so the model does not recompute them at every decoding step. It grows with prompt length, output length, and concurrent users. Because model weights are static but the KV cache is dynamic, KV cache memory — not parameter count — often determines how many requests fit on an accelerator.

Iteration-level scheduling — Orca’s scheduling unit. Instead of treating each user request as a single batch member from start to finish, the serving system decides at every model iteration which active requests continue, which finished requests leave, and which newly arrived requests join. Required for generative workloads where requests have wildly different lengths.

PagedAttention — vLLM’s KV-cache memory manager for reducing waste and improving serving concurrency.

TTFT and TPOT (time to first token, time per output token) — two latency measures that separate when an answer starts from how smoothly it streams.

Goodput — useful work completed within a service-level objective, not merely work completed eventually.

Speculative decoding / sampling — a serving technique that uses cheaper proposals to reduce the number of expensive decoding passes.

Training made frontier AI look expensive. Inference made it operationally expensive.

The distinction became unavoidable once large models moved from research demos into products. A lab might train a model once, or a few times, at enormous cost. But a product has to run the model every time a user asks a question, uploads a document, requests a summary, calls a tool, or speaks into a microphone. The meter starts again with every prompt. Each output token consumes compute, memory bandwidth, scheduling attention, and latency budget.

This was the quiet shift underneath the product shock. The public saw chat interfaces, copilots, multimodal assistants, and agents. Operators saw queues of requests that had to be packed onto scarce accelerators without making users wait. A model that is impressive in a benchmark is not automatically a product. It has to answer fast enough, cheaply enough, and reliably enough under load.

Autoregressive generation is the first reason. A conventional classifier can often process an input once and produce a result. A large language model generating text loops. It reads the prompt, produces a token, then feeds the growing sequence back through the model to produce the next token. A long answer is not one model run. It is a chain of repeated decoding steps.

That loop turns latency into a compound problem. There is time to first token, the delay before a user sees the answer begin. There is time per output token, the rhythm at which the answer continues. There is throughput, the number of requests or tokens a serving system can handle. There is utilization, the fraction of expensive accelerator capacity doing useful work. There is goodput, the useful work completed while meeting the service-level objective rather than merely producing tokens too late to matter.

These quantities pull against one another. A serving system can often improve throughput by batching requests together, but batching can make individual users wait. It can reduce latency by serving smaller batches, but that may strand accelerator capacity. It can accept longer contexts, but those contexts consume memory that might otherwise serve more users. Product AI is therefore not only model science. It is queueing, memory management, and systems engineering.

That is why inference economics is different from a simple price tag. A token is not a uniform object in the machinery. Prompt tokens and output tokens stress the system differently. A short prompt with a long answer behaves differently from a long document with a short answer. A single-user demo can look smooth while a production service faces bursts, cancellations, retries, tool calls, and users whose outputs finish at different times. The product problem is to make all of that fit into a predictable service.

Orca made the scheduling problem concrete. Yu and collaborators described serving for Transformer-based generative models as different from ordinary request processing because generation requires many model iterations. Existing request-level batching waited for a batch of requests to finish together. That is a poor match for variable-length generation. One user may ask for a sentence. Another may ask for a long essay. If the batch is treated as a single unit, finished requests can wait, and new arrivals can be blocked from joining useful work already in progress.

The old batching picture is easy to imagine. Gather several requests. Run them together. Return results when the batch completes. That works best when each request has roughly similar work. Autoregressive generation breaks that assumption. Requests arrive at different times, have different prompt lengths, and produce different output lengths. Some finish early. Some keep generating. The serving system has to decide what to do every iteration, not merely every request.

Orca’s answer was iteration-level scheduling. Instead of treating the whole request as the scheduling unit, it looked at the repeated model iterations. At each iteration, the system could decide which active requests should continue, which finished requests should leave, and which new requests could enter. Selective batching let different operations be batched when they could be batched and separated when their shapes or requirements differed.

Selective batching matters because generation is not one perfectly uniform operation repeated forever. Different requests can have different attention shapes, different prompt lengths, and different points in the generation loop. A serving system that batches everything too coarsely wastes time waiting for mismatched work. A serving system that batches nothing leaves the accelerator underfilled. Orca’s contribution was to show that the right unit of scheduling had moved inward, from the user request to the generation iteration.

The result was not just an optimization detail. It was a change in what “serving a model” meant. A generative model server became a live scheduler for many partially complete conversations. The system had to keep accelerator work dense while respecting latency. Orca reported a 36.9x throughput improvement at the same latency against FasterTransformer for GPT-3 175B in its evaluation. That number should be read as a paper result, not as a universal production guarantee. Its historical value is that it named the central problem: generation workloads need scheduling built around the token loop.

Scheduling was only one side of the bill. Memory was the other.

FlashAttention showed why attention performance was not only about arithmetic. Transformers had become associated with matrix multiplication and FLOPs, but Dao and collaborators emphasized IO-awareness: the cost of reading and writing data between high-bandwidth memory and faster on-chip SRAM. Standard attention materializes a large attention matrix. For long sequences, that matrix is expensive to move through memory. FlashAttention avoided reading and writing the full sequence-by-sequence attention matrix to high-bandwidth memory and reported large wall-clock speedups for attention computation.

The lesson was broader than one kernel. GPUs are fast at math, but only when the right data is in the right place at the right time. If the computation constantly waits on memory movement, the theoretical FLOP rate is a mirage. A serving system can have powerful accelerators and still lose money or latency to memory traffic. Long contexts make this sharper because attention grows with sequence length, and the data structures around the request grow with the conversation.

This is where the memory hierarchy enters the history of AI products. High-bandwidth GPU memory is precious. On-chip SRAM is faster but much smaller. CPU memory is larger but slower. Disk is larger still but much slower. The serving engineer’s job is partly to decide what must live close to the compute, what can be moved, what can be compressed, and what can be avoided entirely. The model’s public behavior depends on these hidden decisions.

The hierarchy is easy to underestimate because software often hides it. A user asks a question and sees words appear. Underneath, tensors are being moved through layers of storage with very different capacities and speeds. If a kernel saves arithmetic but increases traffic to high-bandwidth memory, it may not help. If it avoids unnecessary reads and writes, it can make an exact computation faster without changing the model’s mathematical result. That was the appeal of IO-aware attention: it turned memory movement into an explicit design target.

FlashAttention belongs in this chapter because it changed the mental model. Faster inference was not simply “use a bigger GPU” or “reduce FLOPs.” It was also “reduce unnecessary memory movement.” Exact attention could become faster by respecting the hardware memory hierarchy. That idea would echo through serving systems where the limiting object was often not the static model weights, but the dynamic state attached to each request.

That dynamic state is the KV cache.

When a Transformer generates tokens, it does not want to recompute every key and value vector from scratch at every step. It stores prior keys and values so later decoding can attend to them. This cache is invisible to the user. The user sees a conversation history. The serving system sees memory that grows as the prompt and output grow. The longer the context and the more concurrent users, the more cache memory is required.

The vLLM/PagedAttention paper made the KV cache the central object of serving economics. Kwon and collaborators described KV cache memory as huge, dynamic, and prone to waste through fragmentation and duplication. Model weights are mostly static once loaded. KV cache is different: it is per request, changes over time, and competes directly with batch size. If cache memory is wasted, fewer requests can fit on the accelerator. If fewer requests fit, throughput falls or latency rises.

Fragmentation is the plain-language problem. A serving system may reserve more memory for a request than it eventually needs, or allocate chunks that are hard to reuse efficiently as sequences grow and finish. The result is not just untidy memory. It is lost capacity. Expensive GPU memory can sit unusable because it is trapped in the wrong shape.

The waste is especially painful because the cache competes with batching. More cache memory per request means fewer simultaneous requests can fit. Fewer simultaneous requests mean weaker batching opportunities. Weaker batching means lower utilization or higher per-request cost. A user asking for a longer answer may therefore affect not only that user’s latency, but the serving system’s ability to keep many other requests packed efficiently.

PagedAttention borrowed an operating-system idea. Instead of storing a request’s KV cache only in one contiguous block, it organized cache into blocks, like pages in virtual memory. A request could use non-contiguous physical blocks while the serving logic treated them as a coherent sequence. That reduced waste and made sharing or reusing cache blocks more practical.

This analogy is powerful because it shows the maturity of the field. Large-model serving began to look like an operating-system problem: allocate memory, avoid fragmentation, schedule work, share resources, and maintain responsiveness. vLLM reported near-zero KV cache waste and 2-4x serving-throughput improvement at the same latency compared with FasterTransformer and Orca in its evaluations. Again, the exact multiplier is paper-specific. The historical shift is that cache management became product economics.

The KV cache also explains why context windows are not free. To a user, a longer context feels like a larger memory. To the serving system, it is more tokens whose state has to fit somewhere. Long conversations, document uploads, and agent traces all increase pressure on memory. The business problem is not merely how many parameters the model has. It is how many live conversations can be served at once while their growing state remains close enough to the compute.

It also explains why serving work became inseparable from product design. If an assistant keeps every previous message, every tool result, every retrieved document, and every intermediate reasoning trace in the active context, it may feel convenient but become expensive to serve. If the product summarizes, truncates, caches, or routes context more carefully, it can change the bill without changing the visible model name. The economics of inference push upward into UX and application architecture.

Once inference became this expensive, the field developed a portfolio of cost levers. None solved the problem alone. Each traded one constraint for another.

Quantization attacked the size and arithmetic of the model. SmoothQuant framed quantization as both memory reduction and inference acceleration. Large language models contain activation outliers that make straightforward low-precision quantization difficult. SmoothQuant’s approach migrated some of that difficulty from activations to weights, enabling W8A8 quantization for large-model matrix multiplications. The paper reported up to 1.56x speedup and 2x memory reduction with negligible accuracy loss in its evaluated settings.

The historical point is not that one quantization method became the final answer. It is that serving made numerical precision an economic decision. Lower precision can mean less memory, more throughput, and cheaper inference, but it has to preserve enough quality for the product. The model is no longer only a mathematical artifact. It is an artifact being fitted into hardware constraints.

This is a different kind of compromise from training. During training, precision choices affect optimization stability and final model behavior. During serving, the model already exists, and the question is how cheaply it can be run without losing unacceptable quality. Post-training quantization is attractive because it can be applied after the enormous training bill has already been paid. It turns deployment into an engineering surface.

Offload attacked a different boundary. FlexGen considered high-throughput generation when GPU memory was limited. Instead of assuming the entire workload had to fit inside GPU memory, it aggregated GPU, CPU, and disk memory and used compression and scheduling to move tensors through the hierarchy. In its setup, FlexGen reported large throughput improvements for OPT-175B on a single T4 compared with offloading baselines.

FlexGen is important because it shows the distinction between latency-sensitive chat and throughput-oriented inference. Moving data through CPU memory or disk can make sense when the job values throughput over immediacy. It is less attractive for a real-time assistant where users are waiting for the next token. Serving economics is therefore workload-specific. A background summarization batch, an offline dataset labeling job, a real-time chat session, and a voice assistant do not have the same acceptable trade-off.

Speculative decoding attacked the serial loop itself. Leviathan, Kalman, and Matias described using a faster draft model to propose several tokens, then using the larger target model to verify them. If the drafted tokens are accepted, the expensive model can advance more than one token per target-model pass while preserving the target distribution. A related speculative sampling line made the same basic bargain: spend cheap computation on proposals to reduce expensive sequential decoding.

The appeal is obvious. Autoregressive decoding is slow because each token depends on previous tokens. If a smaller model can guess likely next tokens and the larger model can verify them in parallel, the serving system may reduce the number of expensive passes. The papers reported roughly 2x-3x or 2-2.5x speedups in evaluated settings.

The caveats matter. Speculation is not free. Speedup depends on draft quality, acceptance rate, overhead, batch size, and decoding method. A draft that is often rejected wastes work. A draft that is too large or too slow can erase the benefit. The technique illustrates the serving mindset: do not only make the model faster; restructure the work so fewer expensive steps are needed.

Together, quantization, offload, and speculation show why there was no single “inference fix.” One technique changes numerical representation. Another changes where tensors live. Another changes how many target-model passes are needed. Each can help under the right workload and disappoint under the wrong one. The systems problem is choosing a mix that matches the product’s tolerance for latency, quality loss, engineering complexity, and hardware cost.

By 2024, DistServe gave the chapter its cleanest serving axis: prefill and decode are different problems.

Prefill is the work of processing the prompt and getting to the first generated token. Decode is the repeated work of producing output tokens after that. DistServe framed user experience around time to first token and time per output token. These are not the same latency. A user cares when the answer starts, and then how smoothly it continues. A system that optimizes one can hurt the other.

Zhong and collaborators argued that colocating prefill and decoding on the same GPUs can create interference and resource coupling. Prefill often has different computational characteristics from decode. The paper described prefill as compute-bound in important settings, while decoding processes one token at a time but still faces significant memory and IO pressure. If the two phases are batched together too bluntly, one phase can disrupt the other.

The distinction also makes “fast” less vague. A system can have a low time to first token and then generate slowly. It can start slowly and then stream quickly. It can satisfy average latency while violating the tail latency that users notice during spikes. DistServe’s goodput framing matters because throughput alone can be misleading: a request completed after its SLO is not equally useful to the product. The serving system is paid, in effect, for timely work.

DistServe separated the phases onto different GPUs and optimized for goodput under service-level objectives. Its reported results were large: up to 7.4x more requests or 12.6x tighter SLOs than state-of-the-art baselines in its evaluation. The exact numbers belong to the paper’s setup. The enduring idea is that inference serving became phase-aware. A product system could no longer treat a request as one undifferentiated blob of computation.

The prefill/decode split also explains why product requirements complicate hardware planning. A long prompt with a short answer stresses the system differently from a short prompt with a long answer. A retrieval-augmented workflow that stuffs many documents into context may increase prefill work. A chatty assistant producing long responses may emphasize decode. A voice product may make both phases more latency-sensitive because silence feels awkward. The serving architecture has to match the workload.

This is where Ch62’s multimodal interface turns into infrastructure pressure. Images, documents, audio, and tool traces do not magically disappear when the assistant begins to answer. They have to be encoded, scheduled, routed, cached, or summarized. The details differ by modality and system, but the product expectation is clear: the assistant should respond as if mixed media were natural. Inference economics is what decides whether that expectation can be met at scale.

It also explains why averages are not enough. A product can look efficient on aggregate and still feel broken to the users who land in the slow tail. Serving systems therefore care about distributions: bursty arrivals, unusually long prompts, long generations, and cache-heavy sessions. The economic goal is not simply maximum throughput in a quiet lab run. It is useful throughput while the service remains responsive under messy demand.

This is the deeper meaning of inference economics. Cost is not just dollars per token. It is the shape of the computation. It is batch size, memory bandwidth, cache fragmentation, precision, offload distance, draft acceptance, first-token latency, token cadence, and SLO compliance. Each product promise maps onto a serving constraint.

The old story of AI infrastructure centered on training clusters: how to gather enough accelerators, feed them enough data, and keep them synchronized long enough to produce a frontier model. That story remains important. But the product era added a second infrastructure story. Once millions of people use the model, the lab becomes a serving operator. It needs schedulers, memory managers, quantizers, routers, caches, fallback paths, monitoring, and capacity plans.

The invisible serving layer also affects reliability. A model can fail because its answer is wrong, but a product can fail because it starts too slowly, stalls mid-answer, drops context, overloads a cache, or routes a request to the wrong tier. These are not philosophical alignment failures. They are ordinary systems failures made expensive by the cost of each token.

This changed what counted as model quality. A model that is slightly smarter but much slower may be worse for many products. A cheaper model that answers within the latency budget may be more valuable for routine work. A system that uses a strong model only when needed and a cheaper path otherwise can be more economically viable than a monolithic frontier model answering every request. The serving layer becomes part of the product’s intelligence.

Routing follows from the same logic. Some prompts need a frontier model. Some need a smaller model, a cached answer, a retrieval step, or a refusal. Some can wait. Some are interactive and cannot. The serving stack increasingly has to decide not only how to run a model, but which path should handle a request. That decision is economic, technical, and product-facing at once.

This is why inference became a discipline of margins. A small reduction in wasted KV cache, a better batch scheduler, a lower-precision path that preserves quality, or a speculative decoder that works for common prompts can change the cost curve when multiplied across millions of requests. Conversely, a feature that doubles context length or adds a tool loop can erase those gains. The user sees capability; the operator sees compounding resource claims.

Inference economics also sets up the next constraints. If serving every request consumes scarce memory and accelerator time, edge deployment becomes attractive but difficult. If low latency matters, geography and device placement matter. If tokens become industrial workload, power and datacenter planning become central. The product era did not end the training race. It added a second race: making intelligence cheap enough, fast enough, and available enough to be used all day.

The user sees an answer appear in a chat window. Underneath, a scheduling system is making thousands of small economic decisions. Which requests share a batch? Which tokens occupy cache? Which precision is acceptable? Which phase gets which GPU? Which work can be drafted, compressed, offloaded, or delayed? Inference economics is the name for that hidden discipline.