December 2024 – Shekhar Gulati

The importance of starting with Error Analysis in LLM Applications

I was watching a video(link below) posted by Hamel on his Youtube channel. In the video, Hamel recommended that before we write any automated evals for our LLM application we should spend a good amount of time looking at the data. He said “Keep looking at the data until you are not learning anything new from it”.

The time and effort you spend doing manual error analysis will help you identify areas you should focus on. You will learn how users use your application, kind of queries(short, long, keywords, etc) they fire, are you retrieving the right context, is your response generated as per your expectations(instructions), etc.

You start by reviewing individual user interactions, categorizing errors in a straightforward manner (e.g., using a spreadsheet, or a low-code UI), and prioritizing fixes based on real user data. By focusing on recurring issues observed across interactions, you can address the most significant pain points before creating formal evaluation metrics.

The process involves iterating on insights gained from user data, leveraging synthetic data to simulate scenarios, and ensuring tests are motivated by real-world errors rather than arbitrary assumptions. This pragmatic methodology ensures evaluations are meaningful, guiding improvements in coherence, fluency, and relevance while fostering effective development practices.

You can watch the video here

Why LLM Vocabulary Size Matters

Large Language Models (LLMs) are at the forefront of AI advancements, capable of understanding and generating human-like text. Central to their operation is the concept of vocabulary—a predefined set of words or tokens that the model uses to interpret and generate text. But why does vocabulary size matter? This blog delves into the intricacies of LLM vocabulary, its impact on model performance, and how different approaches influence the effectiveness of these models.

What is LLM Vocabulary?

An LLM vocabulary is the set of tokens—ranging from characters to phrases—that a model uses to represent input and output text, enabling it to process and understand language efficiently. Tokenization, the process of converting text into tokens, is a critical step that influences how efficiently a model can process and understand data.

For example, the word “tokenization” might be split into smaller subwords like “token” and “ization,” or it could be represented as a single token, depending on the model’s vocabulary size and tokenization strategy. This has profound implications for the model’s performance, resource requirements, and capabilities.

Giving OpenAI Predicted Output a Try

OpenAI recently released a new feature in their API called Predicted Outputs, designed to reduce the latency of model responses in scenarios where much of the response is already known in advance. A good use case for this is when making incremental changes to code or text. I am currently working on an LLM use case for a customer, building a JSON document generator for their low-code tool. The user interacts through a chat interface to incrementally build the JSON document. I decided to give Predicted Outputs a try to see how it performs with this use case.