Shekhar Gulati

Paper: How faithful are RAG models?

I read How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs’ internal prior paper today and thought of sharing two important things I learnt from the paper. I find this paper useful as it helps in thinking about how to build RAG systems.

#1. Impact of answer generation prompt on response

Researchers investigated how different prompting techniques affect how well a large language model (LLM) uses information retrieved by a Retrieval-Augmented Generation (RAG) system. The study compared three prompts: “strict” which told the model to strictly follow the RAG information, “loose” which encouraged the model to use its judgement based on context, and “standard”. Following were the definitions of these prompts.

As mentioned in the paper

We observe lower and steeper drops in RAG adherence with the loose vs strict prompts, suggesting that prompt wording plays a significant factor in controlling RAG adherence.

This suggests that the way you ask the LLM a question can significantly impact how much it relies on the provided information. The study also looked at how these prompts affected different LLMs, finding similar trends across the board. Overall, the research highlights that carefully choosing how you prompt an LLM can have a big impact on the information it uses to answer your questions.

The above also implies that for the problems where you only want to guide the LLM answer generation you can rely on standard or loose prompt formats. For example, I am building a learning tool for scrum masters and product owners. In this scenario I only want to use the retrieved knowledge for guidance purpose so using standard or loose prompt formats make sense.

# 2. Likelihood of a model adhering to retrieved information in RAG settings change with the model’s confidence in its response without context

The second interesting point discussed in the paper is relationship between model’s confidence in its answer without context and retrieved information. Imagine you ask a large language model a question, but it’s not sure if the answer it already has is the best. New information is then provided to help it refine its response. This information is typically called context. The study here shows that the model is less likely to consider this context if it was very confident in its initial answer.

As the model’s confidence in its response without context (its prior probability) increases, the likelihood of the model to adhere to the retrieved information presented in context (RAG preference rate) decreases. This inverse correlation indicates that the model is more likely to stick to its initial response when it is more confident in its answer without considering the context. This relationship holds true across different domain datasets and is influenced by the choice of prompting technique, such as strictly adhering or loosely adhering to the retrieved information. The tension between the model’s pre-trained knowledge and the information provided in context is highlighted by this inverse correlation.

We can use logprobs to calculate the confidence score

LLM Tools #1: HackerNews Discussion Summarizer

I have started to build many small, single purpose tools using LLMs and Generative AI that helps improve my productivity. I add these little tools to my web UI based AI assistant. One such tool that I built recently is summarising long HackerNews discussions.

For example, the thread on GitHub Copilot Workspace: Technical Preview which has 323 comments my tool generated following summary.

The above summary is generated using the below prompt.

You are a key point extractor and summarizer.
You are given HackerNews comments for a story and you have to extract key points from these comments.
You have to group the key points in logical groups. Each group should not have more than 5 points.
You should come up with a group name that you will use in the generated response.

Response should be in following format.

## Story Details
1. Title: [Title of the story](url)
2. Total Comments: 123

## Logical Group
1. Key point 1
2. Key point 2

## Logical Group
1. Key point 3
2. Key point 4
3. Key point 5
4. Key point 6

Why you should consider building your own AI assistant?

For the past six months, I’ve been leading the development of a custom AI assistant for our organization. It began with the straightforward concept of offering an alternative to publicly available chat assistants like OpenAI’s ChatGPT or Google’s Gemini. However, it has evolved into a comprehensive platform powering multiple bots tailored to specific business units, departments, and individual needs.

The feedback on the AI Assistant has been positive, with people reporting productivity gains. It is also helping to break down knowledge and information silos within the organization.

A common question I receive is why we opted to build our own solution instead of leveraging existing assistants like ChatGPT, Perplexity, Microsoft’s Enterprise Copilot, or the plethora of other options available. After all, isn’t the AI chat assistant landscape already saturated? While there are indeed numerous products vying for a slice of this multi-billion dollar market, I believe we are still in its nascent stages. The optimal workflows and functionalities to integrate into these tools are still under exploration by all players.

In this blog post, I’ll delve into the reasons why I believe organizations should strongly consider building their own AI assistants. It’s important to clarify that I’m not advocating for everyone to embark on building entirely from scratch.

TIL #3: Using xbar to build ArgoCD deployment monitor

This week I was going over the latest edition(Volume 27) of Thoughtworks Technology Radar and found the addition of xbar in their Tools section. xbar lets you put output from any script/program in your macOS menu bar. I first wrote about it in October 2021 when I showed how you can use it show WordPress page views analytics in your macOS menu bar.

From the Thoughtworks Radar entry on xbar

On remote teams, we sorely lack having a dedicated build monitor in the room; unfortunately, newer continuous integration (CI) tools lack support for the old CCTray format. The result is that broken builds aren’t always picked up as quickly as we’d like. To solve this problem, many of our teams have started using xbar for build monitoring. With xbar, one can execute a script to poll build status, displaying it on the menu bar.

TIL #2: Kafka poison pill message and CommitFailedException

Yesterday I was working with a team that was facing issue with their Kafka related code. The Kafka consumer was failing with the exception

[] ERROR [2022-11-22 08:32:52,853] com.abc.MyKakfaConsumer: Exception while processing events
! java.lang.NullPointerException: Cannot invoke "org.apache.kafka.common.header.Header.value()" because the return value of "org.apache.kafka.common.header.Headers.lastHeader(String)" is null
! at com.abc.MyKakfaConsumer.run(MyKakfaConsumer.java:83)
! at java.base/java.lang.Thread.run(Thread.java:833)

The consumer code looked like as shown below.

TIL #1 : Using liquibase validCheckSum to solve a deployment issue

Taking inspiration from Simon Willison[1] I will start posting TIL (Today I learned) posts on something new/interesting I learn while building software. Today, I was working with a colleague on a problem where in our database migration script was working in the dev environment but failing in the staging environment. The customer platform team has mandated that we can’t access database directly and the only way to fix things is via liquibase scripts. In this post I will not discuss if I agree with them or not. That’s a rant for another day.

In our staging environment we were getting following exception

changelog-main.xml::2::author1 was: 8:a67c8ccae76190339d0fe7211ffa8d98 but is now: 8:d76c3d3a528a73a083836cb6fd6e5654
changelog-main.xml::3::author2 was: 8:0f90fb0771052231b1ax45c1x8bdffax but is now: 8:a25ca918b2eb27a2b453d6e3bf56ff77

If you have worked with Liquibase or any other similar database migration tool you will understand that this happens when developer has changed an existing changeset. This causes checksum to change for an existing changset. So, when next time liquibase tries to apply changset it gives validation error and fails.

Developer should never change an existing changeset and this is one thing we make sure we don’t miss during our code reviews.

Building a simple JSON processor using Java 17 and GraalVM

This week I finally decided to play with GraalVM to build a simple command-line JSON processor based on JsonPath. I find jq syntax too complex for my taste so I decided to build JSON processor based on JsonPath. Since, I wanted to release this as a native executable GraalVM seemed like a good soluton.

GraalVM is a relatively new JVM and JDK implementation built using Java itself. It supports additional programming languages and execution modes, like ahead-of-time compilation of Java applications for fast startup and low memory footprint.

Enforcing Logging best practices by writing ArchUnit tests

We have following three logging best practices:

All loggers should be final variables. So, we prefer

   private final Logger logger = LoggerFactory
.getLogger(MyService.class); // good

Instead of

   private static final Logger LOGGER = LoggerFactory
.getLogger(MyService.class); // bad

Using constant based syntax makes code look ugly and require developers to use shift key for typing upper case variable name. This breaks the flow so we prefer to use field variable naming.

All logs should have description and context So, we prefer

   logger.info("Event is alreay processed so not 
processing it again [eventId={}, eventDbId={}, 
eventType={}]", 
eventId, event.getId(), eventType); // good

instead of

   logger.info("Event is already processed 
so not processing"); // bad

We want logger statement to have enough context so that we can debug problems. The bad logging statement does not help you understand for which event this log statement was logged. All of these statements look similar.

All error logs should have the exception in the context So, we prefer

   logger.error("Exception while processing event 
[eventId={}]", eventId, exception); // good

instead of

   logger.error("Exception while processing event 
[eventId={}]", eventId); // bad

To help developers discover these before raising their pull requests we have written ArchUnit tests for enforcing these practices. So, local build fails in case they violate these best practices. You can read my earlier post on ArchUnit[1] in case you are new to it.

The case for frameworks over libraries (Spring Boot vs Dropwizard)

I am working with a customer where customer took the decision to go with Dropwizard instead of Spring Boot(or Spring ecosystem). I initially respected their decision and decided to give Dropwizard a fair chance. Now after spending a couple of months building a system that uses Dropwizard I don’t recommend it. I wrote about my initial thoughts here.

There are three main advantages of a battle tested and highly used framework like Spring:

Good APIs
Solution to common problems
Good searchable documentation

Let me help you understand that by taking an example of integrating Kakfa in a Dropwizard application. The Dropwizard official organization provides a Kafka bundle[1] so we decided to use it for adding Kafka support. I found following issues with it:

Poor API: When you create the KafkaConsumerBundle in the *Application class you are forced to provide an implementation of ConsumerRebalanceListener. KafkaConsumerBundle does not do anything with it but it forces you to provide it [2]. If you read Kafka documentation you need to provide ConsumerRebalanceListener not at the time of creation but when you subscribe it. ConsumerRebalanceListener is used to commit the offsets in case of partitions. There is also an open issue [3] in Github repo on the same without any answer from the maintainers.

Incomplete configuration: The Dropwizard Kafka bundle does not support all the Kafka producer and consumer configuration properties. For example, it is often recommended to set number of retries in producer to Integer.MAX_VALUE and rely on delivery.timeout.ms to limit the number of retries. Since it is not a configuration option in Dropwizard bundle you have to hardcode it during the bundle creation.

Missing solution to common problems: Any real world Kafka application need to solve for these three common problems. Spring Kafka part of Spring ecosystem provides solution to these problems.

It does not provide serializer/deserialzer for JSON or any other format. You have to write one yourself or find a library that implements it.
Handling of Poison pill messages using ErrorHandlingDeserializer
Publishing of Poison pill messages to a dead letter topic

Conclusion

Yes, you can write your own bundle which fixes all these issues. But, then you are doing the undifferentiated work. Your team can spend time on writing business logic rather than writing and maintain this code. There is no right or wrong answer here. There is a cost that has to be paid when you take these decisions. You should keep that in mind. There is no free lunch.

#1. Impact of answer generation prompt on response

# 2. Likelihood of a model adhering to retrieved information in RAG settings change with the model’s confidence in its response without context

Conclusion

Resources