July 2024 – Shekhar Gulati

Generating architecture.md with code2prompt and OpenAI gpt-4o-mini model

New contributors often struggle to grasp project intricacies without a well-defined architectural understanding documentation. This post exposes how we can leverage the power of Large Language Models (LLMs) to automate the generation of architecture documentation (architecture.md) directly from a project’s codebase. I first learnt about architecture.md from a post that was published in 2021. Many popular open source projects like Caddy have architecture.md in their source code.

We call our script as shown below.

./architecturemd-generator.sh https://github.com/frdel/agent-zero.git

and it generates architecture.md as shown below in the screenshot. You can look at the complete generated architecture.md file in the GitHub repository.

Building a YouTube Video Summarizer with llm and yt-dlp

In this blog post, we’ll create a handy utility that summarizes YouTube videos using the power of large language models (LLMs) and the versatility of Python’s yt-dlp tool. It leverages the summarizing capabilities of llm to extract key points and insights from YouTube subtitles, making it easier to grasp the video’s content without having to watch the entire thing.

Setting the Stage

Before we dive in, let’s ensure you have the necessary tools:

llm: This command-line interface allows us to interact with large language models. Follow the installation instructions on the llm project’s website https://llm.datasette.io/en/stable/index.html.
yt-dlp: This versatile tool helps download various formats from YouTube, including subtitles. Install it using pip install yt-dlp.
Set OPENAI_API_KEY environment variables. This utility defaults to using OpenAI API and gpt-4o-mini model.

GitHub Repo

You can get the complete source code here https://github.com/shekhargulati/llm-tools.

Building Prompt Injection Detector with Text Embeddings and LogisticRegression

In this post, we will discuss how to build a Prompt Injection detector using a simple classification task with Scikit-learn’s Logistic Regression. Logistic Regression is a statistical method for binary classification problems. It helps predict situations with only two possible outcomes.

We will use SPML Chatbot Prompt Injection Dataset for input data.

Install the following libraries:

pip install datasets
pip install sentence-transformers
pip install scikit-learn

We will start by loading the dataset

from datasets import load_dataset
dataset = load_dataset("reshabhs/SPML_Chatbot_Prompt_Injection")

Let’s look at the dataset

dataset

DatasetDict({
    train: Dataset({
        features: ['System Prompt', 'User Prompt', 'Prompt injection', 'Degree', 'Source'],
        num_rows: 16012
    })
})

This displays the dataset structure. There are 16,012 records in this dataset, each with five columns:

System Prompt
User Prompt
Prompt injection
Degree
Source

Leveraging BERTopic to Understand AI Assistant Usage Patterns

I am building and operating a ChatGPT like enteprise AI-Assistant for a year. We log all the user queries in a database for future analysis and building personalized features. We have seen its usage grow over time and it is becoming difficult for our small team(4) to use manual quality analysis methods like eye balling and vibe checks to understand system accuracy and usage patterns.

In this quick post I will cover how we can use BERTopic and OpenAI gpt-4o-mini model to cluster user queries into labelled groups. We will run this analysis on Chatbot Arena dataset.

This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp.

BERTopic is an open-source project offers a novel approach to topic modelling. Topic modelling is an unsupervised and exploratory approach to make sense of bunch of documents.

BERTopic leveraging the power of BERT, a state-of-the-art language model, and c-TF-IDF ( a variation of the traditional TF-IDF (Term Frequency-Inverse Document Frequency) algorithm designed to work with multiple classes or clusters of documents), BERTopic helps uncover hidden thematic structures within your text data. This approach assumes that documents grouped by semantic similarity can effectively represent a topic, where each cluster reflects a major theme and the combined clusters paint a broader picture.

From Screenshots to Markdown Tables with LLMs

One of the tasks I frequently use ChatGPT-like tools for is extracting markdown text from images. I enjoy watching conference videos on YouTube. Often, I find slides during these videos that I want to keep for future reference. To achieve this, I take screenshots and add them to my notebook. However, if I forget to add any textual comments with the screenshots, searching for them later becomes difficult. Additionally, there are times when I need to extract text in markdown format from the screenshots for future use.

Let’s look at an example screenshot that I took yesterday from a talk by OpenAI engineer on Fine Tuning.

Query Rewriting in RAG Applications

Creating an AI assistant that generate helpful answers from a knowledge base is a complex problem. A significant hurdle is the frequent mismatch between how users ask questions and how information is structured within the data. Most people struggle to ask good questions. This often results in irrelevant or incomplete answers, frustrating users.

As builders of these systems we should not expect users to write well-crafted queries. In our application, we have implemented query rewriting to rephrase user queries to better align with the underlying data. This has dramatically improved the accuracy and helpfulness of our AI assistant responses.

In this post I will share details on how we implemented query rewriting in our application. We will end the post by looking at how popular open source systems do query rewriting.

You can hear this blog in podcast format here – https://notebooklm.google.com/notebook/ed6e648e-c95c-4ad8-88a2-767be02c7c4d/audio

A simple optimization that reduced output tokens by 30% in our LLM-based RAG solution

I’ve been running a chat assistant application built on OpenAI for the past year. My biggest learning has come from analyzing our AI assistant responses and finding ways to optimize(both cost and quality) them. Like all RAG applications, we add source URLs to all chunks and instruct the LLM to include citations referencing the source link. Here’s a snippet of our answer generation prompt:

For each document indicate which sources most support it via valid citation markers at the end of sentence in the markdown format. Add a link to the source using markdown format. Also, include page number with the source.

Our analysis revealed that over 60% of our answers contain more than five source links, with listing questions exceeding ten links. These links inflate both input and output tokens.

RouteLLM Paper

Paper Link : https://arxiv.org/pdf/2406.18665v2

Paper Title: RouteLLM: Learning to Route LLMs with Preference Data

With the growing capabilities of large language models (LLMs), efficiently utilizing them becomes crucial. LLM routing emerges as a promising solution. It directs user queries to the most suitable LLM based on factors like complexity and domain. This approach aims to optimize response quality while minimizing costs. However, optimal routing presents a challenge: the router model needs to understand the query’s intent, complexity, and domain, along with the capabilities of candidate LLMs. Additionally, it should be economical, fast, and adaptable to new, improved models.

Practical Takeaways from “APIGen: Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets” Paper

A recent paper by the Salesforce AI research team describes a method for generating function-calling datasets for Large Language Models (LLMs). Function calling enables LLMs to interact with external systems, like remote APIs, databases, or in-process code. This equips LLMs with tools to perform specific actions, such as retrieving weather information, booking reservations, or fetching stock data from APIs.

If you’re unfamiliar with function calling, refer to the OpenAI docs to learn more.

This post explores practical takeaways for developers building LLM applications.

Building a web page summarizer with llm utility

One of the useful LLM tools I’ve recently started using is the llm Python CLI by Simon Willison. It simplifies playing with different LLM models from the command line and allows you to build quick scripts by piping together multiple command-line utilities.

On macOS, you can install llm using brew:

brew install llm

In my daily work, I frequently use LLMs for summarization. Summarization can take many forms, and there’s no single best way to summarize a given text. To address this, I built a CLI using the llm tool that extracts text from a web URL and then summarizes it for me.

The core of the script is the following one-line command:

curl -s https://r.jina.ai//$url | llm -m "$model" -s "$prompt"