Can Claude Single Call and Zero Shot Do What Devin Can’t Do?

I remain skeptical about using LLMs in an autonomous, agentic manner. In my experience, for software development tasks, they are most useful when employed in a chat-driven development manner, where humans guide their behavior and work with them step by step. The Answer.ai team recently published a post about Devin, sharing multiple tasks where their AI agent failed. Devin is a collaborative AI teammate built to help ambitious engineering teams achieve more. According to Answer.ai blog post, out of 20 tasks they gave to Devin, it failed at 14 tasks, succeeded in 3, and 3 were inconclusive. These results are quite disappointing. They have shared the complete task list in the appendix for anyone to try.

Many of the tasks they mentioned seemed achievable using AI assistants like Claude or ChatGPT, so I decided to experiment with Claude to complete one of them. I’ve increasingly been using Claude for coding-related tasks and have a paid subscription. This experiment uses Claude 3.5 Sonnet.

Continue reading “Can Claude Single Call and Zero Shot Do What Devin Can’t Do?”

Only ChatGPT Search Got It Right

I am using an open-source library called Docling to extract text from PDF documents. It was developed by the IBM research team, and the library works surprisingly well for my PDF documents.

from pathlib import Path
from docling.document_converter import DocumentConverter

source = "document.pdf"
converter = DocumentConverter()
result = converter.convert(source)
result.document.save_as_markdown(filename=Path("document.md"))

The code above generated a good-looking Markdown document. It cleanly extracted tables from my PDF. I am still benchmarking it with multiple PDFs, but it has been a good first experience with Docling.

Its README.md mentions that it uses an OCR engine, but it does not specify which one. Before diving into the source code, I decided to see if any GenAI search solutions could find the answer for me.

Continue reading “Only ChatGPT Search Got It Right”

Paper: Using Large Language Models to Promote Health Equity

I enjoyed reading this paper from the NEJM AI journal. This short (3-page) paper takes a positive view of large language models (LLMs) and discusses three potential LLM use cases that can make healthcare more equitable. As the paper mentions, 85% of the articles/papers that focus on equity-related impacts focus on the harm LLMs can cause. Only 15% of the articles/papers focus on equity opportunities.

NEJM AI is a new journal on medical artificial intelligence and machine learning from the NEJM Group. The New England Journal of Medicine (NEJM), published by NEJM Group, is one of the oldest and most prestigious medical journals in the world.

In this paper, the authors discuss three LLM use cases:

  1. LLMs can improve the detection of bias.
  2. LLMs can create structured datasets relevant to health equity.

I won’t discuss the third use case because everyone understands the importance of improving access to health information. However, people often underestimate the effort required to build a useful and reliable AI assistant.

Having spent the last two years working with LLMs, I’ve frequently encountered discussions about their inherent biases from training data. This paper presents an innovative counter-perspective: using LLMs to detect human biases in healthcare settings. While traditional methods relied on simple keyword searches, LLMs can analyze clinical notes to identify subtle linguistic patterns, sentiment, and stereotypes that may indicate bias in patient care. For example, research has shown that doctors more frequently describe Black patients as ‘difficult’ compared to white patients. LLMs can help systematically identify these kinds of biased language patterns by treating medical texts as ‘artifacts’ of a biased healthcare ecosystem.

Similarly, if LLMs can be used to promote propaganda, they can also be powerful tools for detecting it. Their deep understanding of language patterns can be turned from a potential liability into an asset for identifying harmful biases.

The second use case, creating structured datasets, is already happening. Features like structured output have made it much easier to extract data from unstructured documents. However, I still see people writing custom scripts and tools to do this. I believe that just as coding experience has improved in chat applications with the introduction of features like Canvas and artifacts, structured text extraction will also become better. The workflow will be more UI-driven rather than relying on Python code execution with a code interpreter.

Can OpenAI o1 model analyze GitLab Postgres Schema?

In July 2022, I wrote a blog on Gitlab’s Postgres schema design. I analyzed the schema design and documented some of the interesting patterns and lesser-known best practices. If my memory serves me correctly, I believe I spent close to 20 hours spread over a couple of weeks writing the post. The blog post also managed to reach the front page of HackerNews.

I use large language models (LLMs) daily for my work. I primarily use the gpt-4o, gpt-4o-mini, and Claude 3.5 Sonnet models exposed via ChatGPT or Claude AI assistants. I tried the o1 model once when it was launched, but I couldn’t find the right use cases for it. At that time, I tried it for a code translation task, but it didn’t work well. So, I decided not to invest much effort in it.

OpenAI’s o1 series models are large language models trained with reinforcement learning to perform complex reasoning. These o1 models think before they answer, producing a long internal chain of thought before responding to the user.

Continue reading “Can OpenAI o1 model analyze GitLab Postgres Schema?”

PostgreSQL Enum Types with SQLModel and Alembic

While working on a product that uses FastAPI, SQLModel, Alembic, and PostgreSQL, I encountered a situation where I needed to add an enum column to an existing table. Since it took me some time to figure out the correct approach, I decided to document the process to help others who might face similar challenges.

Let’s start with a basic scenario. Assume you have a data model called Task as shown below:

import uuid
from datetime import datetime
from typing import Optional
from sqlmodel import SQLModel, Field

class Task(SQLModel, table=True):
    __tablename__ = "tasks"
    id: uuid.UUID = Field(default_factory=uuid.uuid4, primary_key=True)
    title: str = default=None
    description: str | None = Field(default=None)
    created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))

Using Alembic, you can generate the initial migration script with these commands:

alembic revision --autogenerate -m "Created task table"
alembic upgrade head

Now, let’s say you want to add a status field that should be an enum with two values – OPEN and CLOSED. First, define the enum class:

import enum
class TaskStatus(str, enum.Enum):
    OPEN = "open"
    CLOSED = "closed"

Then, add the status field to the Task class:

class Task(SQLModel, table=True):
    __tablename__ = "tasks"
    id: uuid.UUID = Field(default_factory=uuid.uuid4, primary_key=True)
    title: str = default=None
    description: str | None = Field(default=None)
    created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
    status: Optional[TaskStatus] = None

If you run the Alembic migration commands at this point, it will define the status column as text. However, if you want to create a proper PostgreSQL enum type instead of storing the data as text, you’ll need to follow these additional steps:

  1. Install the alembic-postgresql-enum library:
pip install alembic-postgresql-enum 

or if you’re using Poetry:

poetry add alembic-postgresql-enum
  1. Add the library import to your Alembic env.py file:
import alembic_postgresql_enum
  1. Modify the status field declaration in your Task class to explicitly use the enum type:
from sqlmodel import SQLModel, Field, Enum, Column


class Task(SQLModel, table=True):
    __tablename__ = "tasks"
    id: uuid.UUID = Field(default_factory=uuid.uuid4, primary_key=True)
    title: str = default=None
    description: str | None = Field(default=None)
    created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
    status: Optional[TaskStatus] = Field(default=None, sa_column=Column(Enum(TaskStatus)))

Now you can run the Alembic commands to create a new PostgreSQL type for TaskStatus and use it for the column type:

alembic revision --autogenerate -m "Added status column in tasks table"
alembic upgrade head

To verify that the enum type was created correctly, connect to your PostgreSQL instance using psql and run the \dT+ command:

taskdb=# \dT+
                                          List of data types
 Schema |    Name     | Internal name | Size | Elements  |  Owner   | Access privileges | Description
--------+-------------+---------------+------+-----------+----------+-------------------+-------------
 public | taskstatus  | taskstatus    | 4    | OPEN +    | postgres |                   |
        |             |               |      | CLOSED   +|          |                   |

This approach ensures that your enum values are properly constrained at the database level, providing better data integrity than using a simple text field.

The One GitHub Copilot Feature I Use

A couple of days back, I posted that I prefer to use Chat-driven development using ChatGPT or Claude over using IDE-integrated LLM tools like GitHub Copilot. An old friend reached out and asked if there is any part of my development workflow where LLM IDE integration makes me more productive. It turns out there is one place where I still like to use GitHub Copilot with VS Code: writing Git commit messages after I have made changes. For me, a good clean Git history is important. It helps me understand why I made a change. I’m a lazy person, so I often end up writing poor commit messages.

Continue reading “The One GitHub Copilot Feature I Use”

Giving Microsoft Phi-4 LLM model a try

Microsoft has officially released MIT licensed Phi-4 model. It is available on hugging face https://huggingface.co/microsoft/phi-4.

Phi 4 is a 14B parameter, state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets.

I wanted to give it a try so I used Ollama on runpod.io. You can follow the instructions mentioned here https://docs.runpod.io/tutorials/pods/run-ollama

I used 4-bit quantized model on ollama. You can also try 8-bit and fp16 versions as well. As I mentioned in my last blog 4-bit quantization strikes a good balance between performance and efficiency. I also tried 8bit quantized model but they both worked same for me.

Continue reading “Giving Microsoft Phi-4 LLM model a try”

Paper: A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Yesterday I read the paper – A Comprehensive Evaluation of Quantization Strategies for Large Language Models https://arxiv.org/pdf/2402.16775.

Quantization reduces the memory footprint of models by representing weights and activations in lower-precision formats (e.g., 8-bit integers instead of 16- or 32-bit floats). The paper discusses two approaches:

  1. Quantization-Aware Training (QAT): Quantization is integrated into the training phase, helping models adapt to lower precision.
  2. Post-Training Quantization (PTQ): Quantization is applied after training. PTQ is more common because it avoids the high retraining costs of QAT, despite potential performance trade-offs.

Some key takeaways:

  • 4-bit quantization strikes a good balance between performance and efficiency.
  • Performance drops become noticeable at 3 bits or fewer, and at 2 bits, models often fail to produce coherent outputs.
  • Perplexity serves as a reliable benchmark for evaluating quantized models, showing that 8-bit models closely match their full-precision counterparts, with acceptable trade-offs at 4 bits.

While quantization saves memory, it can reduce inference speed, making it ideal for scenarios where memory is constrained but speed is less critical.

This paper offers practical insights for anyone optimizing LLMs for real-world applications.

Chat-Driven Development: A Better Way to Use LLMs for Coding

In the last couple of years, I have subscribed to GitHub Copilot multiple times, each time canceling its subscription. It never felt natural to me, feeling annoying and getting too much in my way. To me, chat-driven development with ChatGPT or Claude feels more natural. I feel I’m in more control as I can be explicit about what I want, work on smaller problems, and pass relevant context. This helps LLMs generate better code for me.

Today, I was reading a blog where the author shared similar experiences. They listed the following reasons:

  1. Clean slate advantage: The author mentions that IDE workspaces are often messy, repositories too large, and full of distractions. LLMs can get confused with too much complexity and ambiguity. Using chat through a web browser provides a blank slate for well-contained requests.
  2. Control over context: Chat allows carefully crafting specific, exam-style questions with just the necessary background material, leading to better results than overwhelming the LLM with the entire IDE context.
  3. Energy management: The author uses chat-driven programming when they know what needs to be written but lack the energy to start from scratch (especially after 11 AM or during context switches between languages/frameworks). Getting a first draft from an LLM with dependencies and structure makes it easier to fix mistakes than starting from zero.
  4. Protection of development environment: The author explicitly states “I do not want an LLM spewing its first draft all over my current branch.” They prefer keeping LLM experimentation separate from their main development environment.
  5. Feedback cycle: The author finds that using chat makes it easier to iterate on code by simply pasting compiler errors or test failures back to the LLM for corrections, without cluttering the IDE workspace.

Along with the above, there are a few more reasons why I didn’t like LLM IDE integrations:

  1. For me, LLMs are a thinking tool. I use them to think about possible designs, table structures, and do brainstorming. IDE-based LLM integrations do not promote such thinking processes. Their incentive is to generate code only.
  2. I want to reach for help when I need it rather than having it tell me it can help. This inversion of dependency just doesn’t work. ChatGPT requires explicit intent/questions, whereas GitHub Copilot relies on implicit intent.
  3. I prefer editors to be lightweight without any baggage. Fast, basic code completion is what I need from my IDEs.

Reasons inference scaling models like OpenAI o1 might not benefit Agents

I was surprised when I read that OpenAI o1 model does not seem to benefit agents. If a model can think and reason better than it should help agents make better decisions and be more reliable. In the CORE-Bench benchmark o1-mini scored 24% compared to 38% with Claude 3.5 Sonnet. The article cites two main reasons:

  • Inference scale models like o1 require different prompting styles than regular models, and current agentic systems are optimized for prompting regular models.
  • Reasoning or inference scale models so far have not been trained using reinforcement learning in a setting where they receive feedback from the environment — be it code execution, shell interaction, or web search. In other words, their tool use ability is no better than the underlying model before learning to reason.

Another reason I read is that o1 can sometimes behave strangely when integrated into custom agentic architectures, since it was trained to follow a specific reasoning process—indeed, OpenAI actively discourages users from using prompts that ask the model to reason in any specific way.