video – Shekhar Gulati

Talk: AI Engineering at Jane Street

I watched AI Engineering at Jane Street talk yesterday. The talk offers many practical insights on how a good and mature engineering organization like Jane Street are using LLMs in their development process. My notes from the talk below:

Jane Street decided to train their own model. They had to because most of the shelf large language models are not proficient with OCaml due to the limited amount of training data available. This makes it difficult to find suitable off-the-shelf tools that can effectively work with OCaml code.
They took inspiration from a paper by Meta – AI-assisted Code Authoring at Scale. This paper detailed the results of fine-tuning a model specifically for use with Hack, a programming language similar to OCaml in that it is primarily used at one company and not widely adopted outside of that company. After reading this paper, Jane Street became more convinced about the potential of training models, which led them to explore the idea of replicating the results for their own use with OCaml. However, they soon realized that achieving good outcomes would require more than just taking an off-the-shelf model and showing it their code; they needed to collect numerous examples that aligned with their specific goals.
Jane Street collects training data through a method called workspace snapshotting, where they take snapshots of developer workstations throughout the day. This allows them to capture changes and build statuses, which can then be used to train their models effectively. This was an interesting approach to collecting data. It looks like an expensive and complex approach for creating dataset. They had to do this because they were not able to use pull request and commit data to generate this data. Some of the challenges they mentioned in the talk:
The feature descriptions used in their internal code review system (Iron) are not written in the same way as prompts that a developer would use in an editor.
Features can be very large (e.g., 500 to 1000 lines), which complicates their use as training data. Smaller, isolated changes are needed for effective training .
Commits are not isolated changes and lack descriptions, making them less useful for training purposes.
They aligned the model’s outputs with human standards of good code through reinforcement learning. This involves verifying that the generated code can be parsed, type-checked, and passes tests when applied to the codebase.
They also had to build their own code editor integrations. They use sidecare proxy/application to manages context, constructs prompts, and monitors build status for the editor integrations. It allows for seamless updates and changes without requiring individual editor modifications, enhancing the overall developer experience. They collect editor metrics to figure out the latency and effectiveness of the diffs generated by this model.

You can watch the video on Videocrawl – https://www.videocrawl.dev/studio?url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D0ML7ZLMdcl4. Videocrawl is an AI companion application that improves the learning/watching experience.

Creating Data Annotation Apps For Doing Error Analysis of LLM Apps

This is the video I recorded today that covers how I use Claude to build data annotation apps. I create data annotation apps for doing error analysis in my LLM apps. You can also view the video on Videocrawl – https://www.videocrawl.dev/studio?url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dz0Ktg0vLTBQ

Running Large Language Models at Scale

I was watching a talk by Dylan Patel from Semi Analysis where he covered important techniques AI labs are using to do inference at scale. I’ve shared my notes below and highly recommend checking out the full presentation.

There are two distinct phases in inference: prefill and decode. Prefill processes the initial prompt and is compute-intensive, while decode generates tokens iteratively and is memory bandwidth-intensive.

Several key techniques have emerged for efficient inference:

Continuous batching is essential for cost-effective operations. Rather than processing requests individually, continuous batching allows multiple user requests to be processed together, dramatically reducing costs compared to batch size one. This is particularly important when handling asynchronous user requests arriving at different times.
Disaggregated prefill separates the compute-intensive prefill phase from the bandwidth-intensive decode phase. Major providers implement this by using different accelerators for each phase, helping mitigate “noisy neighbor” problems and maintaining consistent performance under varying loads.
Context caching is an emerging optimization that avoids recomputing the key-value (KV) cache for frequently used prompts. While this requires significant CPU memory or storage, it can substantially reduce costs for applications that repeatedly reference the same context, such as legal document analysis. Google first implemented this technique. Now, both OpenAI and Anthropic implements this.

A critical challenge in large-scale deployments is managing “straggler” GPUs. As ByteDance discovered, even in high-end GPU clusters, individual chips can underperform due to the “Silicon Lottery” – natural variations in chip performance. In their case, a single underperforming GPU reduced cluster performance significantly, as training workloads are synchronous and are limited by the slowest component.

For organizations building inference infrastructure, managing memory bandwidth becomes a critical challenge. Running large models requires loading enormous numbers of parameters for each token generation, making memory bandwidth a key constraint for achieving desired tokens-per-second targets while serving multiple users.

The infrastructure challenges compound when scaling to larger models, requiring careful consideration of hardware capabilities, batching strategies, and caching mechanisms to maintain performance and cost-effectiveness.