Chat-Driven Development: A Better Way to Use LLMs for Coding

In the last couple of years, I have subscribed to GitHub Copilot multiple times, each time canceling its subscription. It never felt natural to me, feeling annoying and getting too much in my way. To me, chat-driven development with ChatGPT or Claude feels more natural. I feel I’m in more control as I can be explicit about what I want, work on smaller problems, and pass relevant context. This helps LLMs generate better code for me.

Today, I was reading a blog where the author shared similar experiences. They listed the following reasons:

  1. Clean slate advantage: The author mentions that IDE workspaces are often messy, repositories too large, and full of distractions. LLMs can get confused with too much complexity and ambiguity. Using chat through a web browser provides a blank slate for well-contained requests.
  2. Control over context: Chat allows carefully crafting specific, exam-style questions with just the necessary background material, leading to better results than overwhelming the LLM with the entire IDE context.
  3. Energy management: The author uses chat-driven programming when they know what needs to be written but lack the energy to start from scratch (especially after 11 AM or during context switches between languages/frameworks). Getting a first draft from an LLM with dependencies and structure makes it easier to fix mistakes than starting from zero.
  4. Protection of development environment: The author explicitly states “I do not want an LLM spewing its first draft all over my current branch.” They prefer keeping LLM experimentation separate from their main development environment.
  5. Feedback cycle: The author finds that using chat makes it easier to iterate on code by simply pasting compiler errors or test failures back to the LLM for corrections, without cluttering the IDE workspace.

Along with the above, there are a few more reasons why I didn’t like LLM IDE integrations:

  1. For me, LLMs are a thinking tool. I use them to think about possible designs, table structures, and do brainstorming. IDE-based LLM integrations do not promote such thinking processes. Their incentive is to generate code only.
  2. I want to reach for help when I need it rather than having it tell me it can help. This inversion of dependency just doesn’t work. ChatGPT requires explicit intent/questions, whereas GitHub Copilot relies on implicit intent.
  3. I prefer editors to be lightweight without any baggage. Fast, basic code completion is what I need from my IDEs.

Reasons inference scaling models like OpenAI o1 might not benefit Agents

I was surprised when I read that OpenAI o1 model does not seem to benefit agents. If a model can think and reason better than it should help agents make better decisions and be more reliable. In the CORE-Bench benchmark o1-mini scored 24% compared to 38% with Claude 3.5 Sonnet. The article cites two main reasons:

  • Inference scale models like o1 require different prompting styles than regular models, and current agentic systems are optimized for prompting regular models.
  • Reasoning or inference scale models so far have not been trained using reinforcement learning in a setting where they receive feedback from the environment — be it code execution, shell interaction, or web search. In other words, their tool use ability is no better than the underlying model before learning to reason.

Another reason I read is that o1 can sometimes behave strangely when integrated into custom agentic architectures, since it was trained to follow a specific reasoning process—indeed, OpenAI actively discourages users from using prompts that ask the model to reason in any specific way.

Using a Tor proxy to bypass IP restrictions

Yesterday I was working on a personal project where I need to download transcript of Youtube video. I use unofficial youtube-transcript-api Python library to fetch transcript for any youtube video. It worked fine locally but as soon as I deployed my app on a cloud VM it started giving error.

The code to list transcripts for a video is shown below.

from youtube_transcript_api import YouTubeTranscriptApi

video_id = "eIho2S0ZahI"
transcripts = YouTubeTranscriptApi.list_transcripts(video_id)

The code throws following error on cloud VM.

Could not retrieve a transcript for the video https://www.youtube.com/watch?v=w8rYQ40C9xo! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!

As described in the GitHub issue https://github.com/jdepoix/youtube-transcript-api/issues/303 the only way to overcome this issue is to use a proxy. Most resedential proxies cost somewhere between $ 5-10 for 1 GB data transfer. They are priced per GB so depending on your usage cost will vary. One of the commenter suggested to use Tor proxy. Tor provides anonymity by routing your internet traffic through a series of volunteer-operated servers, which can help you avoid being blocked by services like YouTube.

There is an open source project https://github.com/dperson/torproxy that allows you run Tor proxy in a Docker container.

You can run it as follows:

docker run -p 8118:8118 -p 9050:9050 -d dperson/torproxy

This will run proxy on 9050 port.

You can then change your code to use the proxy.

from youtube_transcript_api import YouTubeTranscriptApi

proxies = {
    'http': "socks5://127.0.0.1:9050",
    'https': "socks5://127.0.0.1:9050",
}

video_id = "eIho2S0ZahI"
transcripts = YouTubeTranscriptApi.list_transcripts(video_id, proxies=proxies)

Paper: Don’t Expect Juniors to Teach Senior Professionals to Use Generative AI

Yesterday I was reading a working paper by Harvard Business School titled Don’t Expect Juniors to Teach Senior Professionals to Use Generative AI: Emerging Technology Risks and Novice AI Risk Mitigation Tactics. The study was conducted with Boston Consulting Group, a global management consulting firm. They interviewed 78 junior consultants in July-August 2023 who had recently participated in a field experiment that gave them access to generative AI (GPT-4) for a business problem solving task.

The paper makes a point that in earlier technologies it was junior professionals who helped senior professionals upskill them with the new technologies. The paper cited multiple reasons why junior professionals are better able to learn and use new technology than their senior counterparts.

  • First, junior professionals are often closest to the work itself, because they are the ones engaging in concrete and less complex tasks
  • Second, junior professionals may be more able to engage in real-time experimentation with new technologies, because they do not risk losing their mandate to lead if those around them, including clients, as well as those more junior to them, recognize that they lack the practical expertise to support their hierarchical position
  • Third, junior professionals may be more willing to learn new methods that conflict with existing identities, practices, and frames
Continue reading “Paper: Don’t Expect Juniors to Teach Senior Professionals to Use Generative AI”

The importance of starting with Error Analysis in LLM Applications

I was watching a video(link below) posted by Hamel on his Youtube channel. In the video, Hamel recommended that before we write any automated evals for our LLM application we should spend a good amount of time looking at the data. He said “Keep looking at the data until you are not learning anything new from it”.

The time and effort you spend doing manual error analysis will help you identify areas you should focus on. You will learn how users use your application, kind of queries(short, long, keywords, etc) they fire, are you retrieving the right context, is your response generated as per your expectations(instructions), etc.

You start by reviewing individual user interactions, categorizing errors in a straightforward manner (e.g., using a spreadsheet, or a low-code UI), and prioritizing fixes based on real user data. By focusing on recurring issues observed across interactions, you can address the most significant pain points before creating formal evaluation metrics.

The process involves iterating on insights gained from user data, leveraging synthetic data to simulate scenarios, and ensuring tests are motivated by real-world errors rather than arbitrary assumptions. This pragmatic methodology ensures evaluations are meaningful, guiding improvements in coherence, fluency, and relevance while fostering effective development practices.

You can watch the video here

Why LLM Vocabulary Size Matters

Large Language Models (LLMs) are at the forefront of AI advancements, capable of understanding and generating human-like text. Central to their operation is the concept of vocabulary—a predefined set of words or tokens that the model uses to interpret and generate text. But why does vocabulary size matter? This blog delves into the intricacies of LLM vocabulary, its impact on model performance, and how different approaches influence the effectiveness of these models.

What is LLM Vocabulary?

An LLM vocabulary is the set of tokens—ranging from characters to phrases—that a model uses to represent input and output text, enabling it to process and understand language efficiently. Tokenization, the process of converting text into tokens, is a critical step that influences how efficiently a model can process and understand data.

For example, the word “tokenization” might be split into smaller subwords like “token” and “ization,” or it could be represented as a single token, depending on the model’s vocabulary size and tokenization strategy. This has profound implications for the model’s performance, resource requirements, and capabilities.

Continue reading “Why LLM Vocabulary Size Matters”

Giving OpenAI Predicted Output a Try

OpenAI recently released a new feature in their API called Predicted Outputs, designed to reduce the latency of model responses in scenarios where much of the response is already known in advance. A good use case for this is when making incremental changes to code or text. I am currently working on an LLM use case for a customer, building a JSON document generator for their low-code tool. The user interacts through a chat interface to incrementally build the JSON document. I decided to give Predicted Outputs a try to see how it performs with this use case.

Continue reading “Giving OpenAI Predicted Output a Try”

Adding Web Search to a LLM Chat Application

One common request from most customers I have helped build custom ChatGPTs for is the ability to support web search. Users want the capability to fetch relevant data from web searches directly within the chat interface. In this post, I will show you how you can quickly add web search functionality to a chat application. We will use Gradio to build a quick chatbot.

Gradio is an open-source Python library that allows you to build interactive web interfaces for machine learning models with ease. It simplifies the process of creating user-friendly applications by providing pre-built components and a straightforward API, enabling developers to focus on the core functionality of their projects without worrying about the intricacies of web development.

Let’s get started.

Continue reading “Adding Web Search to a LLM Chat Application”

Gods, Interns, and Cogs: A useful framework to categorize AI use cases

I recently came across an insightful article by Drew Breunig that introduces a compelling framework for categorizing the use cases of Generative AI (Gen AI) and Large Language Models (LLMs). He classifies these applications into three distinct categories: Gods, Interns, and Cogs. Each bucket represents a different level of automation and complexity, and it’s fascinating to consider how these categories are shaping the AI landscape today.

Continue reading “Gods, Interns, and Cogs: A useful framework to categorize AI use cases”

LLM API Pricing Calculator

I’ve recently begun building small HTML/JS web apps. These are all client-side only. Most of these tools are initially generated using ChatGPT in under 30 minutes. One example is my LLM API price calculator (https://tools.o14.ai/llm-cost-calculator.html), which I generated. While there are other LLM API pricing calculators available, I build/generate these tools for several reasons:

  • Known User Base: I have a need for the tool myself, so I already have the first user.
  • Rapid Prototyping: Experimenting with the initial idea is frictionless. Building the first version typically takes only 5-8 prompts.
  • Focus on Functionality: There’s no pressure to optimize the code at first. It simply needs to function.
  • Iterative Development: Using the initial version reveals additional features needed. For instance, the first iteration of the LLM API pricing calculator only displayed prices based on the number of API calls and input/output tokens. As I used it, I realized the need for filtering by provider and model, so I added those functionalities.
  • Privacy: I know no one is tracking me.
  • Learning by Doing: This process allows me to learn about both the strengths and limitations of LLMs for code generation.

The LLM API pricing calculator shows the pricing of different LLM APIs as shown below. You can look at the source code by using browser view source.

Webpage: https://tools.o14.ai/llm-cost-calculator.html

The prompt that generated the first version of the app is shown below. I also provided an image showing how I want UI to look.

We want to build a simple HTML Javascript based calculator as shown in the image. The table data will come from a remote JSON at location https://raw.githubusercontent.com/BerriAI/litellm/refs/heads/main/model_prices_and_context_window.json. The structure of JSON looks like as shown below. We will use tailwind css. Generate the code

{
  "gpt-4": {
    "max_tokens": 4096,
    "max_input_tokens": 8192,
    "max_output_tokens": 4096,
    "input_cost_per_token": 0.00003,
    "output_cost_per_token": 0.00006,
    "litellm_provider": "openai",
    "mode": "chat",
    "supports_function_calling": true
  },
  "gpt-4o": {
    "max_tokens": 4096,
    "max_input_tokens": 128000,
    "max_output_tokens": 4096,
    "input_cost_per_token": 0.000005,
    "output_cost_per_token": 0.000015,
    "litellm_provider": "openai",
    "mode": "chat",
    "supports_function_calling": true,
    "supports_parallel_function_calling": true,
    "supports_vision": true
  },
}

This API pricing calculator is based on pricing information maintained by LiteLLM project. You can look at all the prices here https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json. In my experience the pricing information is not always latest so you should always confirm the latest price from the provider pricing page.