Running Large Language Models at Scale


I was watching a talk by Dylan Patel from Semi Analysis where he covered important techniques AI labs are using to do inference at scale. I’ve shared my notes below and highly recommend checking out the full presentation.

There are two distinct phases in inference: prefill and decode. Prefill processes the initial prompt and is compute-intensive, while decode generates tokens iteratively and is memory bandwidth-intensive.

Several key techniques have emerged for efficient inference:

  • Continuous batching is essential for cost-effective operations. Rather than processing requests individually, continuous batching allows multiple user requests to be processed together, dramatically reducing costs compared to batch size one. This is particularly important when handling asynchronous user requests arriving at different times.
  • Disaggregated prefill separates the compute-intensive prefill phase from the bandwidth-intensive decode phase. Major providers implement this by using different accelerators for each phase, helping mitigate “noisy neighbor” problems and maintaining consistent performance under varying loads.
  • Context caching is an emerging optimization that avoids recomputing the key-value (KV) cache for frequently used prompts. While this requires significant CPU memory or storage, it can substantially reduce costs for applications that repeatedly reference the same context, such as legal document analysis. Google first implemented this technique. Now, both OpenAI and Anthropic implements this.

A critical challenge in large-scale deployments is managing “straggler” GPUs. As ByteDance discovered, even in high-end GPU clusters, individual chips can underperform due to the “Silicon Lottery” – natural variations in chip performance. In their case, a single underperforming GPU reduced cluster performance significantly, as training workloads are synchronous and are limited by the slowest component.

For organizations building inference infrastructure, managing memory bandwidth becomes a critical challenge. Running large models requires loading enormous numbers of parameters for each token generation, making memory bandwidth a key constraint for achieving desired tokens-per-second targets while serving multiple users.

The infrastructure challenges compound when scaling to larger models, requiring careful consideration of hardware capabilities, batching strategies, and caching mechanisms to maintain performance and cost-effectiveness.


Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.

Leave a comment