generative-ai – Page 5 – Shekhar Gulati

Can OpenAI o1 model analyze GitLab Postgres Schema?

In July 2022, I wrote a blog on Gitlab’s Postgres schema design. I analyzed the schema design and documented some of the interesting patterns and lesser-known best practices. If my memory serves me correctly, I believe I spent close to 20 hours spread over a couple of weeks writing the post. The blog post also managed to reach the front page of HackerNews.

I use large language models (LLMs) daily for my work. I primarily use the gpt-4o, gpt-4o-mini, and Claude 3.5 Sonnet models exposed via ChatGPT or Claude AI assistants. I tried the o1 model once when it was launched, but I couldn’t find the right use cases for it. At that time, I tried it for a code translation task, but it didn’t work well. So, I decided not to invest much effort in it.

OpenAI’s o1 series models are large language models trained with reinforcement learning to perform complex reasoning. These o1 models think before they answer, producing a long internal chain of thought before responding to the user.

The One GitHub Copilot Feature I Use

A couple of days back, I posted that I prefer to use Chat-driven development using ChatGPT or Claude over using IDE-integrated LLM tools like GitHub Copilot. An old friend reached out and asked if there is any part of my development workflow where LLM IDE integration makes me more productive. It turns out there is one place where I still like to use GitHub Copilot with VS Code: writing Git commit messages after I have made changes. For me, a good clean Git history is important. It helps me understand why I made a change. I’m a lazy person, so I often end up writing poor commit messages.

Giving Microsoft Phi-4 LLM model a try

Microsoft has officially released MIT licensed Phi-4 model. It is available on hugging face https://huggingface.co/microsoft/phi-4.

Phi 4 is a 14B parameter, state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets.

I wanted to give it a try so I used Ollama on runpod.io. You can follow the instructions mentioned here https://docs.runpod.io/tutorials/pods/run-ollama

I used 4-bit quantized model on ollama. You can also try 8-bit and fp16 versions as well. As I mentioned in my last blog 4-bit quantization strikes a good balance between performance and efficiency. I also tried 8bit quantized model but they both worked same for me.

Paper: A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Yesterday I read the paper – A Comprehensive Evaluation of Quantization Strategies for Large Language Models https://arxiv.org/pdf/2402.16775.

Quantization reduces the memory footprint of models by representing weights and activations in lower-precision formats (e.g., 8-bit integers instead of 16- or 32-bit floats). The paper discusses two approaches:

Quantization-Aware Training (QAT): Quantization is integrated into the training phase, helping models adapt to lower precision.
Post-Training Quantization (PTQ): Quantization is applied after training. PTQ is more common because it avoids the high retraining costs of QAT, despite potential performance trade-offs.

Some key takeaways:

4-bit quantization strikes a good balance between performance and efficiency.
Performance drops become noticeable at 3 bits or fewer, and at 2 bits, models often fail to produce coherent outputs.
Perplexity serves as a reliable benchmark for evaluating quantized models, showing that 8-bit models closely match their full-precision counterparts, with acceptable trade-offs at 4 bits.

While quantization saves memory, it can reduce inference speed, making it ideal for scenarios where memory is constrained but speed is less critical.

This paper offers practical insights for anyone optimizing LLMs for real-world applications.

Chat-Driven Development: A Better Way to Use LLMs for Coding

In the last couple of years, I have subscribed to GitHub Copilot multiple times, each time canceling its subscription. It never felt natural to me, feeling annoying and getting too much in my way. To me, chat-driven development with ChatGPT or Claude feels more natural. I feel I’m in more control as I can be explicit about what I want, work on smaller problems, and pass relevant context. This helps LLMs generate better code for me.

Today, I was reading a blog where the author shared similar experiences. They listed the following reasons:

Clean slate advantage: The author mentions that IDE workspaces are often messy, repositories too large, and full of distractions. LLMs can get confused with too much complexity and ambiguity. Using chat through a web browser provides a blank slate for well-contained requests.
Control over context: Chat allows carefully crafting specific, exam-style questions with just the necessary background material, leading to better results than overwhelming the LLM with the entire IDE context.
Energy management: The author uses chat-driven programming when they know what needs to be written but lack the energy to start from scratch (especially after 11 AM or during context switches between languages/frameworks). Getting a first draft from an LLM with dependencies and structure makes it easier to fix mistakes than starting from zero.
Protection of development environment: The author explicitly states “I do not want an LLM spewing its first draft all over my current branch.” They prefer keeping LLM experimentation separate from their main development environment.
Feedback cycle: The author finds that using chat makes it easier to iterate on code by simply pasting compiler errors or test failures back to the LLM for corrections, without cluttering the IDE workspace.

Along with the above, there are a few more reasons why I didn’t like LLM IDE integrations:

For me, LLMs are a thinking tool. I use them to think about possible designs, table structures, and do brainstorming. IDE-based LLM integrations do not promote such thinking processes. Their incentive is to generate code only.
I want to reach for help when I need it rather than having it tell me it can help. This inversion of dependency just doesn’t work. ChatGPT requires explicit intent/questions, whereas GitHub Copilot relies on implicit intent.
I prefer editors to be lightweight without any baggage. Fast, basic code completion is what I need from my IDEs.

Reasons inference scaling models like OpenAI o1 might not benefit Agents

I was surprised when I read that OpenAI o1 model does not seem to benefit agents. If a model can think and reason better than it should help agents make better decisions and be more reliable. In the CORE-Bench benchmark o1-mini scored 24% compared to 38% with Claude 3.5 Sonnet. The article cites two main reasons:

Inference scale models like o1 require different prompting styles than regular models, and current agentic systems are optimized for prompting regular models.
Reasoning or inference scale models so far have not been trained using reinforcement learning in a setting where they receive feedback from the environment — be it code execution, shell interaction, or web search. In other words, their tool use ability is no better than the underlying model before learning to reason.

Another reason I read is that o1 can sometimes behave strangely when integrated into custom agentic architectures, since it was trained to follow a specific reasoning process—indeed, OpenAI actively discourages users from using prompts that ask the model to reason in any specific way.

Paper: Don’t Expect Juniors to Teach Senior Professionals to Use Generative AI

Yesterday I was reading a working paper by Harvard Business School titled Don’t Expect Juniors to Teach Senior Professionals to Use Generative AI: Emerging Technology Risks and Novice AI Risk Mitigation Tactics. The study was conducted with Boston Consulting Group, a global management consulting firm. They interviewed 78 junior consultants in July-August 2023 who had recently participated in a field experiment that gave them access to generative AI (GPT-4) for a business problem solving task.

The paper makes a point that in earlier technologies it was junior professionals who helped senior professionals upskill them with the new technologies. The paper cited multiple reasons why junior professionals are better able to learn and use new technology than their senior counterparts.

First, junior professionals are often closest to the work itself, because they are the ones engaging in concrete and less complex tasks
Second, junior professionals may be more able to engage in real-time experimentation with new technologies, because they do not risk losing their mandate to lead if those around them, including clients, as well as those more junior to them, recognize that they lack the practical expertise to support their hierarchical position
Third, junior professionals may be more willing to learn new methods that conflict with existing identities, practices, and frames

Why LLM Vocabulary Size Matters

Large Language Models (LLMs) are at the forefront of AI advancements, capable of understanding and generating human-like text. Central to their operation is the concept of vocabulary—a predefined set of words or tokens that the model uses to interpret and generate text. But why does vocabulary size matter? This blog delves into the intricacies of LLM vocabulary, its impact on model performance, and how different approaches influence the effectiveness of these models.

What is LLM Vocabulary?

An LLM vocabulary is the set of tokens—ranging from characters to phrases—that a model uses to represent input and output text, enabling it to process and understand language efficiently. Tokenization, the process of converting text into tokens, is a critical step that influences how efficiently a model can process and understand data.

For example, the word “tokenization” might be split into smaller subwords like “token” and “ization,” or it could be represented as a single token, depending on the model’s vocabulary size and tokenization strategy. This has profound implications for the model’s performance, resource requirements, and capabilities.

Giving OpenAI Predicted Output a Try

OpenAI recently released a new feature in their API called Predicted Outputs, designed to reduce the latency of model responses in scenarios where much of the response is already known in advance. A good use case for this is when making incremental changes to code or text. I am currently working on an LLM use case for a customer, building a JSON document generator for their low-code tool. The user interacts through a chat interface to incrementally build the JSON document. I decided to give Predicted Outputs a try to see how it performs with this use case.

Adding Web Search to a LLM Chat Application

One common request from most customers I have helped build custom ChatGPTs for is the ability to support web search. Users want the capability to fetch relevant data from web searches directly within the chat interface. In this post, I will show you how you can quickly add web search functionality to a chat application. We will use Gradio to build a quick chatbot.

Gradio is an open-source Python library that allows you to build interactive web interfaces for machine learning models with ease. It simplifies the process of creating user-friendly applications by providing pre-built components and a straightforward API, enabling developers to focus on the core functionality of their projects without worrying about the intricacies of web development.

Let’s get started.