January 2025 – Page 2 – Shekhar Gulati

Paper: A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Yesterday I read the paper – A Comprehensive Evaluation of Quantization Strategies for Large Language Models https://arxiv.org/pdf/2402.16775.

Quantization reduces the memory footprint of models by representing weights and activations in lower-precision formats (e.g., 8-bit integers instead of 16- or 32-bit floats). The paper discusses two approaches:

Quantization-Aware Training (QAT): Quantization is integrated into the training phase, helping models adapt to lower precision.
Post-Training Quantization (PTQ): Quantization is applied after training. PTQ is more common because it avoids the high retraining costs of QAT, despite potential performance trade-offs.

Some key takeaways:

4-bit quantization strikes a good balance between performance and efficiency.
Performance drops become noticeable at 3 bits or fewer, and at 2 bits, models often fail to produce coherent outputs.
Perplexity serves as a reliable benchmark for evaluating quantized models, showing that 8-bit models closely match their full-precision counterparts, with acceptable trade-offs at 4 bits.

While quantization saves memory, it can reduce inference speed, making it ideal for scenarios where memory is constrained but speed is less critical.

This paper offers practical insights for anyone optimizing LLMs for real-world applications.

Chat-Driven Development: A Better Way to Use LLMs for Coding

In the last couple of years, I have subscribed to GitHub Copilot multiple times, each time canceling its subscription. It never felt natural to me, feeling annoying and getting too much in my way. To me, chat-driven development with ChatGPT or Claude feels more natural. I feel I’m in more control as I can be explicit about what I want, work on smaller problems, and pass relevant context. This helps LLMs generate better code for me.

Today, I was reading a blog where the author shared similar experiences. They listed the following reasons:

Clean slate advantage: The author mentions that IDE workspaces are often messy, repositories too large, and full of distractions. LLMs can get confused with too much complexity and ambiguity. Using chat through a web browser provides a blank slate for well-contained requests.
Control over context: Chat allows carefully crafting specific, exam-style questions with just the necessary background material, leading to better results than overwhelming the LLM with the entire IDE context.
Energy management: The author uses chat-driven programming when they know what needs to be written but lack the energy to start from scratch (especially after 11 AM or during context switches between languages/frameworks). Getting a first draft from an LLM with dependencies and structure makes it easier to fix mistakes than starting from zero.
Protection of development environment: The author explicitly states “I do not want an LLM spewing its first draft all over my current branch.” They prefer keeping LLM experimentation separate from their main development environment.
Feedback cycle: The author finds that using chat makes it easier to iterate on code by simply pasting compiler errors or test failures back to the LLM for corrections, without cluttering the IDE workspace.

Along with the above, there are a few more reasons why I didn’t like LLM IDE integrations:

For me, LLMs are a thinking tool. I use them to think about possible designs, table structures, and do brainstorming. IDE-based LLM integrations do not promote such thinking processes. Their incentive is to generate code only.
I want to reach for help when I need it rather than having it tell me it can help. This inversion of dependency just doesn’t work. ChatGPT requires explicit intent/questions, whereas GitHub Copilot relies on implicit intent.
I prefer editors to be lightweight without any baggage. Fast, basic code completion is what I need from my IDEs.

Reasons inference scaling models like OpenAI o1 might not benefit Agents

I was surprised when I read that OpenAI o1 model does not seem to benefit agents. If a model can think and reason better than it should help agents make better decisions and be more reliable. In the CORE-Bench benchmark o1-mini scored 24% compared to 38% with Claude 3.5 Sonnet. The article cites two main reasons:

Inference scale models like o1 require different prompting styles than regular models, and current agentic systems are optimized for prompting regular models.
Reasoning or inference scale models so far have not been trained using reinforcement learning in a setting where they receive feedback from the environment — be it code execution, shell interaction, or web search. In other words, their tool use ability is no better than the underlying model before learning to reason.

Another reason I read is that o1 can sometimes behave strangely when integrated into custom agentic architectures, since it was trained to follow a specific reasoning process—indeed, OpenAI actively discourages users from using prompts that ask the model to reason in any specific way.

Using a Tor proxy to bypass IP restrictions

Yesterday I was working on a personal project where I need to download transcript of Youtube video. I use unofficial youtube-transcript-api Python library to fetch transcript for any youtube video. It worked fine locally but as soon as I deployed my app on a cloud VM it started giving error.

The code to list transcripts for a video is shown below.

from youtube_transcript_api import YouTubeTranscriptApi

video_id = "eIho2S0ZahI"
transcripts = YouTubeTranscriptApi.list_transcripts(video_id)

The code throws following error on cloud VM.

Could not retrieve a transcript for the video https://www.youtube.com/watch?v=w8rYQ40C9xo! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!

As described in the GitHub issue https://github.com/jdepoix/youtube-transcript-api/issues/303 the only way to overcome this issue is to use a proxy. Most resedential proxies cost somewhere between $ 5-10 for 1 GB data transfer. They are priced per GB so depending on your usage cost will vary. One of the commenter suggested to use Tor proxy. Tor provides anonymity by routing your internet traffic through a series of volunteer-operated servers, which can help you avoid being blocked by services like YouTube.

There is an open source project https://github.com/dperson/torproxy that allows you run Tor proxy in a Docker container.

You can run it as follows:

docker run -p 8118:8118 -p 9050:9050 -d dperson/torproxy

This will run proxy on 9050 port.

You can then change your code to use the proxy.

from youtube_transcript_api import YouTubeTranscriptApi

proxies = {
    'http': "socks5://127.0.0.1:9050",
    'https': "socks5://127.0.0.1:9050",
}

video_id = "eIho2S0ZahI"
transcripts = YouTubeTranscriptApi.list_transcripts(video_id, proxies=proxies)

Paper: Don’t Expect Juniors to Teach Senior Professionals to Use Generative AI

Yesterday I was reading a working paper by Harvard Business School titled Don’t Expect Juniors to Teach Senior Professionals to Use Generative AI: Emerging Technology Risks and Novice AI Risk Mitigation Tactics. The study was conducted with Boston Consulting Group, a global management consulting firm. They interviewed 78 junior consultants in July-August 2023 who had recently participated in a field experiment that gave them access to generative AI (GPT-4) for a business problem solving task.

The paper makes a point that in earlier technologies it was junior professionals who helped senior professionals upskill them with the new technologies. The paper cited multiple reasons why junior professionals are better able to learn and use new technology than their senior counterparts.

First, junior professionals are often closest to the work itself, because they are the ones engaging in concrete and less complex tasks
Second, junior professionals may be more able to engage in real-time experimentation with new technologies, because they do not risk losing their mandate to lead if those around them, including clients, as well as those more junior to them, recognize that they lack the practical expertise to support their hierarchical position
Third, junior professionals may be more willing to learn new methods that conflict with existing identities, practices, and frames