The importance of starting with Error Analysis in LLM Applications

I was watching a video(link below) posted by Hamel on his Youtube channel. In the video, Hamel recommended that before we write any automated evals for our LLM application we should spend a good amount of time looking at the data. He said “Keep looking at the data until you are not learning anything new from it”.

The time and effort you spend doing manual error analysis will help you identify areas you should focus on. You will learn how users use your application, kind of queries(short, long, keywords, etc) they fire, are you retrieving the right context, is your response generated as per your expectations(instructions), etc.

You start by reviewing individual user interactions, categorizing errors in a straightforward manner (e.g., using a spreadsheet, or a low-code UI), and prioritizing fixes based on real user data. By focusing on recurring issues observed across interactions, you can address the most significant pain points before creating formal evaluation metrics.

The process involves iterating on insights gained from user data, leveraging synthetic data to simulate scenarios, and ensuring tests are motivated by real-world errors rather than arbitrary assumptions. This pragmatic methodology ensures evaluations are meaningful, guiding improvements in coherence, fluency, and relevance while fostering effective development practices.

You can watch the video here

Why LLM Vocabulary Size Matters

Large Language Models (LLMs) are at the forefront of AI advancements, capable of understanding and generating human-like text. Central to their operation is the concept of vocabulary—a predefined set of words or tokens that the model uses to interpret and generate text. But why does vocabulary size matter? This blog delves into the intricacies of LLM vocabulary, its impact on model performance, and how different approaches influence the effectiveness of these models.

What is LLM Vocabulary?

An LLM vocabulary is the set of tokens—ranging from characters to phrases—that a model uses to represent input and output text, enabling it to process and understand language efficiently. Tokenization, the process of converting text into tokens, is a critical step that influences how efficiently a model can process and understand data.

For example, the word “tokenization” might be split into smaller subwords like “token” and “ization,” or it could be represented as a single token, depending on the model’s vocabulary size and tokenization strategy. This has profound implications for the model’s performance, resource requirements, and capabilities.

Continue reading “Why LLM Vocabulary Size Matters”

Giving OpenAI Predicted Output a Try

OpenAI recently released a new feature in their API called Predicted Outputs, designed to reduce the latency of model responses in scenarios where much of the response is already known in advance. A good use case for this is when making incremental changes to code or text. I am currently working on an LLM use case for a customer, building a JSON document generator for their low-code tool. The user interacts through a chat interface to incrementally build the JSON document. I decided to give Predicted Outputs a try to see how it performs with this use case.

Continue reading “Giving OpenAI Predicted Output a Try”

Adding Web Search to a LLM Chat Application

One common request from most customers I have helped build custom ChatGPTs for is the ability to support web search. Users want the capability to fetch relevant data from web searches directly within the chat interface. In this post, I will show you how you can quickly add web search functionality to a chat application. We will use Gradio to build a quick chatbot.

Gradio is an open-source Python library that allows you to build interactive web interfaces for machine learning models with ease. It simplifies the process of creating user-friendly applications by providing pre-built components and a straightforward API, enabling developers to focus on the core functionality of their projects without worrying about the intricacies of web development.

Let’s get started.

Continue reading “Adding Web Search to a LLM Chat Application”

Gods, Interns, and Cogs: A useful framework to categorize AI use cases

I recently came across an insightful article by Drew Breunig that introduces a compelling framework for categorizing the use cases of Generative AI (Gen AI) and Large Language Models (LLMs). He classifies these applications into three distinct categories: Gods, Interns, and Cogs. Each bucket represents a different level of automation and complexity, and it’s fascinating to consider how these categories are shaping the AI landscape today.

Continue reading “Gods, Interns, and Cogs: A useful framework to categorize AI use cases”

LLM API Pricing Calculator

I’ve recently begun building small HTML/JS web apps. These are all client-side only. Most of these tools are initially generated using ChatGPT in under 30 minutes. One example is my LLM API price calculator (https://tools.o14.ai/llm-cost-calculator.html), which I generated. While there are other LLM API pricing calculators available, I build/generate these tools for several reasons:

  • Known User Base: I have a need for the tool myself, so I already have the first user.
  • Rapid Prototyping: Experimenting with the initial idea is frictionless. Building the first version typically takes only 5-8 prompts.
  • Focus on Functionality: There’s no pressure to optimize the code at first. It simply needs to function.
  • Iterative Development: Using the initial version reveals additional features needed. For instance, the first iteration of the LLM API pricing calculator only displayed prices based on the number of API calls and input/output tokens. As I used it, I realized the need for filtering by provider and model, so I added those functionalities.
  • Privacy: I know no one is tracking me.
  • Learning by Doing: This process allows me to learn about both the strengths and limitations of LLMs for code generation.

The LLM API pricing calculator shows the pricing of different LLM APIs as shown below. You can look at the source code by using browser view source.

Webpage: https://tools.o14.ai/llm-cost-calculator.html

The prompt that generated the first version of the app is shown below. I also provided an image showing how I want UI to look.

We want to build a simple HTML Javascript based calculator as shown in the image. The table data will come from a remote JSON at location https://raw.githubusercontent.com/BerriAI/litellm/refs/heads/main/model_prices_and_context_window.json. The structure of JSON looks like as shown below. We will use tailwind css. Generate the code

{
  "gpt-4": {
    "max_tokens": 4096,
    "max_input_tokens": 8192,
    "max_output_tokens": 4096,
    "input_cost_per_token": 0.00003,
    "output_cost_per_token": 0.00006,
    "litellm_provider": "openai",
    "mode": "chat",
    "supports_function_calling": true
  },
  "gpt-4o": {
    "max_tokens": 4096,
    "max_input_tokens": 128000,
    "max_output_tokens": 4096,
    "input_cost_per_token": 0.000005,
    "output_cost_per_token": 0.000015,
    "litellm_provider": "openai",
    "mode": "chat",
    "supports_function_calling": true,
    "supports_parallel_function_calling": true,
    "supports_vision": true
  },
}

This API pricing calculator is based on pricing information maintained by LiteLLM project. You can look at all the prices here https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json. In my experience the pricing information is not always latest so you should always confirm the latest price from the provider pricing page.

Monkey patching autoevals to show token usage

I use autoevals library to write evals for evaluating output of LLMs. In case you have never written an eval before let me help you understand it with a simple example. Let’s assume that you are building a quote generator where you ask a LLM to generate inspirational Steve Jobs quote for software engineers.

Continue reading “Monkey patching autoevals to show token usage”

Prompt Engineering Lessons We Can Learn From Claude System Prompts

Anthropic published Claude’s System prompts on their documentation website this week. Users spend countless hours getting AI assistants to leak their system prompts. So, Anthropic publishing system prompt in open suggest two things: 1) Prompt leakage is less of an attack vector than most people think 2) any useful real world GenAI application is much more than just the system prompt (They are compound AI systems with a user friendly UX/interface/features, workflows, multiple search indexes, and integrations).

Compound AI systems, as defined by the Berkeley AI Research (BAIR) blog, are systems that tackle AI tasks by combining multiple interacting components. These components can include multiple calls to models, retrievers or external tools. Retrieval augmented generation (RAG) applications, for example, are compound AI systems, as they combine (at least) a model and a data retrieval system. Compound AI systems leverage the strengths of various AI models, tools and pipelines to enhance performance, versatility and re-usability compared to solely using individual models.

Anthropic has released system prompts for three models – Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku. We will look at Claude 3.5 Sonnet system prompt (July 12th, 2024) . The below system prompt is close to 1200 input tokens long.

Continue reading “Prompt Engineering Lessons We Can Learn From Claude System Prompts”

Using ffmpeg, yt-dlp, and gpt-4o to Automate Extraction and Explanation of Python Code from YouTube Videos

Today I was watching a video on LLM evaluation https://www.youtube.com/watch?v=SnbGD677_u0. It is a long video(2.5 hours) with multiple sections. There are multiple speakers covering different sections. In one of the sections speaker showed code in Jupyter notebooks. Because of the small font and pace at which speaker was talking it was hard to follow the section.

I was thinking if I could use youtube-dlp along with an LLM to solve this problem. This is what I want to do:

  1. Download the specific section of a video
  2. Take screenshot of different frames in that video section
  3. Send the screenshots to LLM to extract code
  4. Ask LLM to explain the code in a step by step manner
Continue reading “Using ffmpeg, yt-dlp, and gpt-4o to Automate Extraction and Explanation of Python Code from YouTube Videos”

How I use LLMs: Building a Tab Counter Chrome Extension

Last night, I found myself overwhelmed by open tabs in Chrome. I wondered how many I had open, but couldn’t find a built-in tab counter. While third-party extensions likely existed, I am not comfortable installing them.

Having built Chrome extensions before (I know, it’s possible in a few hours!), the process usually frustrates me. Figuring out permissions, content scripts vs. service workers, and icon creation (in various sizes) consumes time. Navigating the Chrome extension documentation can be equally daunting.

These “nice-to-have” projects often fall by the wayside due to the time investment. After all, I can live without a tab counter.

LLMs(Large Language Models) help me build such projects. Despite their limitations, they significantly boost my productivity on such tasks. Building a Chrome extension isn’t about resume padding; it’s about scratching an itch. LLMs excel in creating these workflow-enhancing utilities with automation. I use them to write single-purpose bash scripts, python scripts, and Chrome extensions. You can find some of my LLM wrapper tools on GitHub here.

Continue reading “How I use LLMs: Building a Tab Counter Chrome Extension”