February 2025 – Shekhar Gulati

Solving a Fifth Grade Puzzle with Latest Reasoning Models

My 10-year-old niece sent me a puzzle that she wanted me to solve. She has her general knowledge paper tomorrow and was unable to get the correct answer, so she asked if I could solve the puzzle. My wife and I both tried to solve it, and we got the same answer that my niece already had. The problem was that, according to her book, that answer was not correct. So, I thought I’d ask reasoning models to find the answer.

Below is the puzzle. We have to find the value for the question mark:

It is a simple number pattern puzzle. You have to identify which pattern this puzzle uses to find the number. My wife, my niece, and I all got 9 as the answer. My niece’s book suggested 7 as the answer. We just couldn’t figure out how to get 7. To us, it looks like a printing mistake in her book.

Below is what each of the models produced. I uploaded the image and prompted them to “Solve the puzzle.”

The Shape of Stories: Vonnegut-Inspired Analysis Tool with Claude 3.7 Sonnet

I was watching a video which introduced me to Kurt Vonnegut shapes of stories. Kurt Vonnegut believed that all stories have simples shapes. He proposed that stories can be categorized into basic shapes based on their emotional arcs. Below are eight example shapes he discussed in his work.

Can Claude 3.7 Sonnet generate SVG illustration for Maha Kumbh?

Anthropic released Claude 3.7 sonnet yesterday and I am liking vibes of it in my coding related tasks. Claude 3.7 Sonnet is a reasoning model like OpenAI o1/o3, Deepseek R1, Grok 3, and Google Gemini 2.0 Thinking.

I asked Claude 3.7 Sonnet to generate SVG visualization of Maha Kumbh. Kumbh Mela is an important Hindu pilgrimage, celebrated approximately every 6 or 12 years, correlated with the partial or full revolution of Jupiter. A ritual dip in the waters marks the festival.

I know we should use an image model but I wanted to see how far we can go with SVG with current reasoning models.

Below is the prompt

Generate an SVG illustration of the Maha Kumbh Mela, capturing its spiritual essence and vibrant atmosphere. The scene should include:

- Naga Sadhus (holy men) with ash-covered bodies and matted hair, some meditating and others engaging in rituals.
- Priests performing traditional Hindu ceremonies with sacred fire (yagna) and offering prayers.
- Devotees of various ages taking a holy dip in the river, expressing devotion and spiritual connection.
- People with folded hands praying to Lord Shiva, some with rudraksha beads and others offering milk or flowers.
- A grand backdrop of temple structures, flags, and colorful tents representing the spiritual congregation.
- Include iconic elements like tridents (trishul), damaru (drum), and Om symbols to reinforce the devotion to Lord Shiva.

Ensure the composition is detailed, with intricate line work and vibrant color contrasts to reflect the grandeur and sanctity of Maha Kumbh.

Below is the output of Claude 3.7 Sonnet with Extended Thinking. It thought for 4 seconds and generated below.

I also tried Grok3 as well. I had to tell it multiple times to generate SVG code. It defaulted to explanation with no code. After multiple tries this is what it generated.

Lastly, I tried OpenAI o3 mini. It generated following SVG.

I liked Claude 3.7 Sonnet version. I think it captured few things well like Shivling.

Paper: Expect the Unexpected: FailSafe Long Context QA for Finance

I am always looking for practical, real-world papers that can help in my work. I provide AI and LLM-related consultancy to multiple clients, most of whom are in the financial domain. One of the first things I do as part of my consulting engagement is create test datasets that can help me baseline and improve AI systems. This usually requires spending time with business/domain folks and looking at/reading/analyzing a lot of data. Today, I stumbled upon a paper Expect the Unexpected: FailSafe Long Context QA for Finance that details how they created a realistic dataset specific to the financial domain. For each record in the dataset, they have introduced query and context perturbations and evaluated how different models perform. They have done benchmarking on both reasoning and non-reasoning models. The paper covers two main aspects:

Testing how well LLMs handle real-world variations in queries and document quality when processing financial information
Focusing on long-context scenarios (like 10-K reports) where accuracy is crucial

xAI Grok 3 is impressive

Yes, I remember that I posted a couple of days back that I was not impressed by Grok 3. Until then I didn’t have access to the model. I only saw the launch video by Elon and his team. For some reason I found the video too boring, scripted, and self-boasting that I decided to write off xAI Grok 3. Today, I saw on X that Grok 3 is freely available, so I decided to give it a try.

I am working on a problem where we have multiple Excel reports with different formats and we want to read all the tables in these Excel files in a generic manner. We can have one or more tables in an Excel sheet. I have tried two approaches before trying out the Grok 3 approach, and they worked to some extent.

My Thoughts on xAI Grok 3

Just wasted my 40 mins watching Grok 3 launch video. Below is what I posted on HN thread.

I don’t know, but I found the recording uninspiring. There was nothing new for me. We’ve all seen reasoning models by now—we know they work well for certain use cases. We’ve also seen “Deep Researchers,” so nothing new there either.

No matter what people say, they’re all just copying OpenAI. I’m not a huge fan of OpenAI, but I think they’re still the ones showing what can be done. Yes, xAI might have taken less time because of their huge cluster, but it’s not inspiring to me. Also, the dark room setup was depressing.

Do Nothing Script Generator

I learnt about do nothing scripting this week. Do-nothing scripting is a way to structure manual workflows into interactive scripts that guide the user step by step without automating the process immediately. By encapsulating each step in a function, this method ensures that no steps are skipped, reduces cognitive load, and provides a structured way to transition toward full automation over time.

You can convert your static documents to do nothing scripts. Instead of reading a document, you interacts with a script that prompts you for action, making it easier to maintain consistency and track progress. Then, eventually you can replace these functions with automation.

Love is like a fine-tuned model

Read below in an email newsletter from Turing post.

Running Large Language Models at Scale

I was watching a talk by Dylan Patel from Semi Analysis where he covered important techniques AI labs are using to do inference at scale. I’ve shared my notes below and highly recommend checking out the full presentation.

There are two distinct phases in inference: prefill and decode. Prefill processes the initial prompt and is compute-intensive, while decode generates tokens iteratively and is memory bandwidth-intensive.

Several key techniques have emerged for efficient inference:

Continuous batching is essential for cost-effective operations. Rather than processing requests individually, continuous batching allows multiple user requests to be processed together, dramatically reducing costs compared to batch size one. This is particularly important when handling asynchronous user requests arriving at different times.
Disaggregated prefill separates the compute-intensive prefill phase from the bandwidth-intensive decode phase. Major providers implement this by using different accelerators for each phase, helping mitigate “noisy neighbor” problems and maintaining consistent performance under varying loads.
Context caching is an emerging optimization that avoids recomputing the key-value (KV) cache for frequently used prompts. While this requires significant CPU memory or storage, it can substantially reduce costs for applications that repeatedly reference the same context, such as legal document analysis. Google first implemented this technique. Now, both OpenAI and Anthropic implements this.

A critical challenge in large-scale deployments is managing “straggler” GPUs. As ByteDance discovered, even in high-end GPU clusters, individual chips can underperform due to the “Silicon Lottery” – natural variations in chip performance. In their case, a single underperforming GPU reduced cluster performance significantly, as training workloads are synchronous and are limited by the slowest component.

For organizations building inference infrastructure, managing memory bandwidth becomes a critical challenge. Running large models requires loading enormous numbers of parameters for each token generation, making memory bandwidth a key constraint for achieving desired tokens-per-second targets while serving multiple users.

The infrastructure challenges compound when scaling to larger models, requiring careful consideration of hardware capabilities, batching strategies, and caching mechanisms to maintain performance and cost-effectiveness.

Reducing size of Docling Pytorch Docker image

Last couple of days I’ve been working on optimizing the Docker image size of a PDF processing microservice. The service uses Docling, an open-source library developed by IBM Research, which internally uses PyTorch. Docling can extract text from PDFs and various other document types. Here’s a simplified version of our FastAPI microservice that wraps Docling’s functionality.

import os
import shutil
from pathlib import Path
from docling.document_converter import DocumentConverter
from fastapi import FastAPI, UploadFile

app = FastAPI()
UPLOAD_DIR = "uploads"
os.makedirs(UPLOAD_DIR, exist_ok=True)
converter = DocumentConverter()

@app.post("/")
async def root(file: UploadFile):
    file_location = os.path.join(UPLOAD_DIR, file.filename)
    with open(file_location, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    result = converter.convert(Path(file_location))
    md = result.document.export_to_markdown()
    return {"filename": file.filename, "text": md}

The microservice workflow is straightforward:

Files are uploaded to the uploads directory
Docling converter processes the uploaded file and converts it to markdown
The markdown content is returned in the response

Here are the dependencies listed in requirements.txt:

fastapi==0.115.8
uvicorn==0.34.0
python-multipart==0.0.20
docling==2.18.0

You can test the service using this cURL command:

curl --request POST \
  --url http://localhost:8000/ \
  --header 'content-type: multipart/form-data' \
  --form file=@/Users/shekhargulati/Downloads/example.pdf

On the first request, Docling downloads the required model from HuggingFace and stores it locally. On my Intel Mac machine, the initial request for a 4-page PDF took 137 seconds, while subsequent requests took less than 5 seconds. For production environments, using a GPU-enabled machine is recommended for better performance.

The Docker Image Size Problem

Initially, building the Docker image with this basic Dockerfile resulted in a massive 9.74GB image:

FROM python:3.12-slim
RUN apt-get update \
    && apt-get install -y
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

docling-blog  v1  51d223c334ea   22 minutes ago   9.74GB

The large size is because PyTorch’s default pip installation includes CUDA packages and other GPU-related dependencies, which aren’t necessary for CPU-only deployments.

The Solution

To optimize the image size, modify the pip installation command to download only CPU-related packages using PyTorch’s CPU-specific package index. Here’s the optimized Dockerfile:

FROM python:3.12-slim
RUN apt-get update \
    && apt-get install -y \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cpu
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Building with this optimized Dockerfile reduces the image size significantly:

docling-blog v2 ac40f5cd0a01   4 hours ago     1.74GB

The key changes that enabled this optimization:

Added --no-cache-dir to prevent pip from caching downloaded packages
Used --extra-index-url https://download.pytorch.org/whl/cpu to specifically download CPU-only PyTorch packages
Added rm -rf /var/lib/apt/lists/* to clean up apt cache

This optimization reduces the Docker image size by approximately 82%, making it more practical for deployment and distribution.