I was watching a video(link below) posted by Hamel on his Youtube channel. In the video, Hamel recommended that before we write any automated evals for our LLM application we should spend a good amount of time looking at the data. He said “Keep looking at the data until you are not learning anything new from it”.
The time and effort you spend doing manual error analysis will help you identify areas you should focus on. You will learn how users use your application, kind of queries(short, long, keywords, etc) they fire, are you retrieving the right context, is your response generated as per your expectations(instructions), etc.
You start by reviewing individual user interactions, categorizing errors in a straightforward manner (e.g., using a spreadsheet, or a low-code UI), and prioritizing fixes based on real user data. By focusing on recurring issues observed across interactions, you can address the most significant pain points before creating formal evaluation metrics.
The process involves iterating on insights gained from user data, leveraging synthetic data to simulate scenarios, and ensuring tests are motivated by real-world errors rather than arbitrary assumptions. This pragmatic methodology ensures evaluations are meaningful, guiding improvements in coherence, fluency, and relevance while fostering effective development practices.
You can watch the video here
