papers – Shekhar Gulati

Notes from Gemini Embedding Paper

I was reading a paper by the Google DeepMind team on how they trained Gemini Embedding, a state-of-the-art, unified embedding model. This is the second paper I’ve read this month on training embedding models. Last week, I read about how the Jina embedding model was trained. The Jina embedding paper was thin and lacked details, so I didn’t write about it. This paper is full of insights, so I thought I’d write a short post sharing what I learned.

Gemini Embedding achieves state-of-the-art performance across MMTEB’s multilingual, English, and code benchmarks.

Gemini embeddings use a multi-resolution loss (MRL) so that a single model can produce embeddings of different sizes (768, 1536, 3072). During training, the model applies separate contrastive losses on different sub-portions of the embedding vector, ensuring that both shorter and longer embeddings are well-trained. This provides flexibility: smaller embeddings for efficiency, larger ones for accuracy — all from the same model.

They cite two main reasons why the Gemini Embedding model achieves state-of-the-art performance in benchmarks:

The Gemini Embedding model is initialized from the weights of the Gemini LLM backbone. They also note that several recent embedding models such as E5-Mistral, SFR-Mistral, BGE-ICL, and NV-Embed have been initialized from the Mistral-7B (Jiang et al., 2023) backbone and then further adapted as embedding models. The same is true for the jina-code-embeddings-0.5b and 1.5b models, as they are built on the Qwen2.5-Coder-0.5B and Qwen2.5-Coder-1.5B backbones.
The second reason they cite is high-quality datasets. These datasets are synthetically generated using Gemini LLM. They mention: “Leveraging Gemini’s diverse capabilities, we train Gemini Embedding on a comprehensive suite of embedding tasks. To construct a high-quality, heterogeneous training dataset, we employ Gemini for several critical data curation steps: filtering low-quality examples, determining relevant positive and negative passages for retrieval, and generating rich synthetic datasets. This curated dataset facilitates training with a contrastive learning objective, enabling Gemini Embedding to learn robust semantic representations.”

In the paper, they also mention that the Gemini embedding model is trained with a contrastive loss that pulls queries close to their correct targets while pushing away incorrect ones. Negatives are usually sampled from the same batch, and sometimes hard negatives are added to make learning more robust. Each example is also tagged with a task type, which conditions the model to learn embeddings useful across different domains like Q&A or fact-checking.

Each training example also includes a task description such as "question answering" or "fact checking". This string tells the model what kind of relationship between the query and target it should focus on. In effect, it makes the embeddings task-aware, allowing a single embedding model to generalize across multiple use cases.

They also discuss that to train the model they used a two-stage process — Pre-finetuning and Finetuning.

Pre-finetuning: First, the model is “pre-finetuned” on a large number of potentially noisy (query, target) pairs, omitting the hard-negative term from the loss function. They found it beneficial to use a large batch size, as the primary objective is to adapt the parameters from autoregressive generation to encoding.
Finetuning: Next, the model is fine-tuned on a large mixture of task-specific datasets containing (query, target, hard negative target) triples. For this phase of training, they found it beneficial to use smaller batch sizes (e.g., less than 1024) and to limit each batch to a single dataset, as distinguishing a given positive target from in-batch targets from the same task provides greater signal than discerning (say) a retrieval target from a classification label.

Paper: AbsenceBench: Why Language Models Struggle to Detect Missing Information

While large language models (LLMs) have achieved remarkable capabilities in processing long contexts and locating specific information, a recent paper reveals a surprising blind spot: they struggle significantly when asked to identify what’s missing. The AbsenceBench paper by researchers from University of Chicago and Stanford University exposes a fundamental limitation that has far-reaching implications for how we evaluate and deploy these models.

The Problem: Detecting Absence vs. Presence

Large language models excel at the “Needle in a Haystack” (NIAH) test, where they successfully locate specific information buried within long documents. However, AbsenceBench introduces the inverse challenge: given an original document and a modified version with deliberately removed content, can models identify what’s missing?

The results are sobering. Even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. This represents a dramatic performance gap compared to their near-perfect performance on information retrieval tasks.

The Research Methodology

The researchers designed AbsenceBench across three distinct domains to test different types of missing information:

Numerical sequences: Mathematical progressions with specific numbers removed
Poetry: Excerpts from the Gutenberg Poetry Corpus with missing lines
GitHub pull requests: Code diffs with deliberately omitted lines

The complete dataset is available on Hugging Face and contains 4,302 instances across all domains with an average context length of 5K tokens. You can easily load the dataset for experimentation:

from datasets import load_dataset

# Load specific domain
dataset = load_dataset("harveyfin/AbsenceBench", "poetry")
# Or load other domains: "sequences", "github_prs"

To demonstrate this limitation, below is a simple reproduction. You can view the complete implementation here.

from openai import OpenAI

client = OpenAI()

system_prompt = """You are helping a student practice memorizing poems.
The student will recite a poem, but they may have missed some lines.
Your task is to identify exactly which lines are missing from their recitation.
List only the missing lines, nothing else."""

user_message = f"""Here is the complete original poem:
{original_context}
Now, here is my recitation which may be missing some lines:
{modified_context}
What lines did I miss? Please list only the missing lines, nothing else."""

response = client.responses.create(
    model="gpt-4-1-mini",
    instructions=system_prompt,
    input=user_message,
    temperature=0
)

Striking Performance Results

The experimental results reveal the extent of this limitation. In one test with 72 lines deliberately removed from a document:

gpt-4.1-mini performance:

Identified correctly: 37 missing lines
Failed to identify: 35 missing lines
False positives: 132 lines incorrectly flagged as missing

o4-mini performance:

Identified correctly: 37 missing lines
Failed to identify: 35 missing lines
False positives: 6 lines incorrectly flagged as missing

While o4-mini significantly reduced false positives, both models still missed nearly half of the actual omissions, demonstrating that this isn’t simply a problem solved by more advanced reasoning capabilities.

The Attention Mechanism Hypothesis

The research suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to “gaps” in documents since these absences don’t correspond to any specific keys that can be attended to.

This insight is crucial for understanding why the problem exists. Traditional Transformer architecture excels at finding and attending to relevant information present in the input. However, when information is missing, there’s literally nothing for the attention mechanism to focus on—creating a fundamental blind spot in the model’s processing capabilities.

The Placeholder Solution

The researchers discovered a remarkable workaround: inserting placeholders where content is missing dramatically improves performance by 35.7% on average. This finding supports their hypothesis about attention mechanisms struggling with “gaps.” When placeholders provide something concrete to attend to, models can better identify the missing content.

This suggests that the issue isn’t necessarily about understanding absence conceptually, but rather about the architectural limitations of how Transformers process information gaps.

Real-World Implications

This limitation has serious implications for several critical applications:

LLM-as-a-Judge Systems: When AI models evaluate content, essays, or responses, their inability to detect missing information could lead to inflated scores and missed deficiencies.

Legal and Regulatory Analysis: As mentioned in the original research context, regulatory intelligence systems that compare document versions need to reliably identify what has been removed or changed between iterations.

Misinformation Detection: Detecting misinformation often requires identifying what key information is conspicuously absent from a claim or report.

Academic and Content Evaluation: Grading systems that rely on LLMs may fail to penalize incomplete responses appropriately.

Quality Assurance: Any system using LLMs to verify completeness of documentation, procedures, or reports faces inherent limitations.

A Broader Pattern of AI Limitations

The paper’s findings illuminate what researchers call the “jagged frontier” of AI capabilities—where models can be superhuman at one task while failing unexpectedly at a closely related one. As noted in recent analysis, “Long context models have been getting increasingly good at passing ‘Needle in a Haystack’ tests recently, but what about a problem in the opposite direction?”

This pattern suggests that current evaluation methods may be missing critical failure modes. The stark contrast between NIAH performance and AbsenceBench results highlights how misleading current evaluations might be if they ignore absence detection.

Future Directions

The research opens several important avenues for improvement:

Architecture Innovation: Developing attention mechanisms specifically designed to handle information gaps and absences.

Evaluation Framework Enhancement: Incorporating absence detection as a standard evaluation criterion alongside traditional benchmarks.

Training Methodology: Exploring ways to train models explicitly on absence detection tasks.

Placeholder Strategies: Investigating optimal approaches for using placeholders to improve absence detection in practical applications.

Conclusion

AbsenceBench reveals a fundamental blind spot in current language models that goes beyond simple performance metrics. The inability to reliably detect missing information represents a core limitation that could undermine trust in AI systems across numerous high-stakes applications.

Bottom line: While LLMs excel at finding needles in haystacks, they struggle significantly when the needle was never there to begin with. This limitation suggests we need both architectural innovations and more comprehensive evaluation frameworks that test for what models can’t see, not just what they can find.

As we continue to deploy these systems in critical applications, understanding and addressing this absence detection limitation becomes not just an interesting research problem, but a necessity for building truly reliable AI systems. The research provides a crucial foundation for developing more robust and trustworthy language models that can handle the full spectrum of information processing tasks—including recognizing when something important is missing.

Resources

Paper: AbsenceBench: Language Models Can’t Tell What’s Missing (arXiv:2506.11440)
Dataset: harveyfin/AbsenceBench on Hugging Face
Code: GitHub repository with full implementation
Quick demo: Gist with reproduction example

Paper: Don’t Expect Juniors to Teach Senior Professionals to Use Generative AI

Yesterday I was reading a working paper by Harvard Business School titled Don’t Expect Juniors to Teach Senior Professionals to Use Generative AI: Emerging Technology Risks and Novice AI Risk Mitigation Tactics. The study was conducted with Boston Consulting Group, a global management consulting firm. They interviewed 78 junior consultants in July-August 2023 who had recently participated in a field experiment that gave them access to generative AI (GPT-4) for a business problem solving task.

The paper makes a point that in earlier technologies it was junior professionals who helped senior professionals upskill them with the new technologies. The paper cited multiple reasons why junior professionals are better able to learn and use new technology than their senior counterparts.

First, junior professionals are often closest to the work itself, because they are the ones engaging in concrete and less complex tasks
Second, junior professionals may be more able to engage in real-time experimentation with new technologies, because they do not risk losing their mandate to lead if those around them, including clients, as well as those more junior to them, recognize that they lack the practical expertise to support their hierarchical position
Third, junior professionals may be more willing to learn new methods that conflict with existing identities, practices, and frames

Paper: How faithful are RAG models?

I read How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs’ internal prior paper today and thought of sharing two important things I learnt from the paper. I find this paper useful as it helps in thinking about how to build RAG systems.

#1. Impact of answer generation prompt on response

Researchers investigated how different prompting techniques affect how well a large language model (LLM) uses information retrieved by a Retrieval-Augmented Generation (RAG) system. The study compared three prompts: “strict” which told the model to strictly follow the RAG information, “loose” which encouraged the model to use its judgement based on context, and “standard”. Following were the definitions of these prompts.

As mentioned in the paper

We observe lower and steeper drops in RAG adherence with the loose vs strict prompts, suggesting that prompt wording plays a significant factor in controlling RAG adherence.

This suggests that the way you ask the LLM a question can significantly impact how much it relies on the provided information. The study also looked at how these prompts affected different LLMs, finding similar trends across the board. Overall, the research highlights that carefully choosing how you prompt an LLM can have a big impact on the information it uses to answer your questions.

The above also implies that for the problems where you only want to guide the LLM answer generation you can rely on standard or loose prompt formats. For example, I am building a learning tool for scrum masters and product owners. In this scenario I only want to use the retrieved knowledge for guidance purpose so using standard or loose prompt formats make sense.

# 2. Likelihood of a model adhering to retrieved information in RAG settings change with the model’s confidence in its response without context

The second interesting point discussed in the paper is relationship between model’s confidence in its answer without context and retrieved information. Imagine you ask a large language model a question, but it’s not sure if the answer it already has is the best. New information is then provided to help it refine its response. This information is typically called context. The study here shows that the model is less likely to consider this context if it was very confident in its initial answer.

As the model’s confidence in its response without context (its prior probability) increases, the likelihood of the model to adhere to the retrieved information presented in context (RAG preference rate) decreases. This inverse correlation indicates that the model is more likely to stick to its initial response when it is more confident in its answer without considering the context. This relationship holds true across different domain datasets and is influenced by the choice of prompting technique, such as strictly adhering or loosely adhering to the retrieved information. The tension between the model’s pre-trained knowledge and the information provided in context is highlighted by this inverse correlation.

We can use logprobs to calculate the confidence score

Paper Summary: How Complex System Fail

Today, I read paper How Complex System Fail by Richard Cook. This paper was written in 2000. This is a short and easy to read paper written by a doctor. He writes about his learnings in the context of patient care, but most of them are applicable to software development as well.

In this post, I go over the points that I liked from the paper and write how I think they are applicable to software development.

The first point raised in the paper is

All complex systems are intrinsically hazardous systems.

With respect to complex software that most of us build, I don’t think hazardous will be the right word. Luckily, no one will die if the software we write fails. I see it as all complex systems are intrinsically important systems for an organisation. Most of the complex software has business and financial importance. So if the software fails, organization will end up loosing money and reputation. These can be catastrophic for the organization and might bring an organization to its knees. It is the business importance of these systems that drives the creation of defences against failure.

This takes us to the next point in the paper:

Complex systems are heavily and successfully defended against failure.

I don’t think from the day one most complex systems have successful defensive mechanisms placed. But, I agree that over time mechanisms will be added to successful operate the application. Few of the mechanisms that we use in software are data backup, redundancy, monitoring, periodic health checks, training people, creating change approval process, and many others. With time, an organization builds a series of shields that normally divert operations away from failures. Another key point relevant to software is that when we try to rebuild a system we should expect that the new system will become stable over time. Many times you deploy a new system and it fails because the mechanisms used in old system might not be sufficient to run the new system. Over time, an organization will put in place mechanisms to successfully operate the new application.

The next point mentioned in this paper is:

Catastrophe requires multiple failures – single point failures are not enough.

I disagree with author that single point of failures are not enough for catastrophic failures. I think it depends on the way the complex software is built. If your complex software systems is built in such a way that fault in one service brings cause failure in the service using the first one. Then, a single failure in the least important component can cause the entire application to break. So, if you are building complex distributed applications then you should use circuit breakers, service discovery, retries, etc to build resilient applications.

The next point raised by author is not obvious to most of us:

Complex systems contain changing mixtures of failures latent within them.

Author says that it is impossible for us to run complex systems without flaws. I have read a similar point somewhere that Amazon tries for five-nine availability i.e. 99.999%. Five-nines or 99.999% availability means 5 minutes, 15 seconds or less of downtime in a year. They don’t try to go further because it becomes too costly after this time and customers can’t notice the difference between their internet not working and service becoming unavailable.

A corollary of the above point is:

Complex systems run in degraded mode.

The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.

The next point in the paper is:

Catastrophe is always just around the corner.

Another way to put the above point is Failure is the norm. Deal with it. If you are building distributed systems then you must be clear in your mind that different components will fail. They might lead to complete failure of the system. So, we should try to build systems that can handle individual component failure. Author writes, it is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present by the system’s own nature.

Then, author goes on to write:

Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.

I don’t agree entirely with this point. The author is saying that there is no single cause for failure. According to him, there are multiple contributors to the failure. I agree that multiple reasons would have let to failure. But, I don’t think root cause analysis is fundamentally wrong. I think root cause analysis is the starting point to understand why things go wrong. In software world, blameless postmortem is done to understand failure with the intention of not finding which team or person is responsible but to learn from failure and avoid such failure in future.

The next point brings human biases into picture:

Hindsight biases post-accident assessments of human performance.

It always look obvious in the hindsight that we should have avoided the failure given all the events that happened. This is the most difficult point to follow in practice. I don’t know how we can overcome biases. Author writes, Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.

The next point mentioned in the paper is:

All practitioner actions are gambles.

This is an interesting point. If you have ever tried to bring a failure system back to state, you will agree with the author. In those uncertain times, we try different things to bring the system up but at best they are our guesses based on our previous learnings. Author writes, practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

I like the way author makes a point about human expertise:

Human expertise in complex systems is constantly changing.

I think most software organisations want to use the latest and greatest technology to build complex systems but they don’t invent enough in developing expertise. Humans are the most critical part of any complex system so we need to plan for their training skill refinement as a function of system itself. Failure to do this will lead to software failures.

As they say, change is the only constant. The next point talks:

Change introduces new forms of failure.

I will quote directly from the paper as author puts it eloquently

The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. When new technologies are used to eliminate well understood system failures or to gain high precision performance they often introduce new pathways to large scale, catastrophic failures.

The last point mentioned in the paper sums it all:

Failure free operations require experience with failure.

As they say, failure is the world greater teacher. When you encounter performance issues or difficult to predict situations then we push the envelope of our understanding of the system. These are the situations that help us discover new mechanisms that we need to put in our system to handle future failures.

Conclusion

I thoroughly enjoyed reading this paper. It has many learnings for software developers. If we become conscious about them I think we can write better resilient software.

Paper Summary: Simple Testing Can Prevent Most Failures

Today evening, I decided to read paper Simple Testing Can Prevent Most Failures: An analysis of Production Failures in Distributed Data-intensive systems.

This paper asks an important question

Why widely used distributed data systems designed for high availability like Cassandra, Redis, Hadoop, HBase, and HDFS. experience failures and what can be done to increase their resiliency?

We have to answer this question keeping in mind that these systems are developed by some of the best software developers in the world following good software development practices and are intensely tested.

These days most of us are building distributed systems. We can apply the findings shared in this post to build systems that are more resilient to failure.

The paper shares:

Most of the catastrophic system failures are result of incorrect handling of non-fatal errors explicitly signalled in the software

This falls into 1) empty error handling blocks, or error blocks with just log statement 2) the error handing aborts the clusters on an overly-general exception 3) the error handling code contains expressions like “FIXME” or “TODO” in the comments.

Most of the developers are guilty of doing all the three above mentioned. Developers are good at finding that something will go wrong but they don’t know what to do when something goes wrong. I looked at the error handling code in one of my projects and I found the same behaviour. I had written TODO comments or caught general exceptions. These are considered to be bad practices but still most of us end up doing.

Overall, we found that the developers are good at anticipating possible errors. In all but one case, the errors were checked by the developers. The only case where developers did not check the error was an unchecked error system call return in Redis.

Another important point mentioned in the paper is

We found that 74% of the failures are deterministic in that they are guaranteed to manifest with an appropriate input sequence, that almost all failures are guaranteed to manifest on no more than three nodes, and that 77% of the failures can be reproduced by a unit test.

Most popular open source projects use unit testing so it could be surprising that the existing tests were not good enough to catch these bugs. Part of this has to do with the fact that these bugs or failure situations happens when a sequence of events happen. The good part is that sequence is deterministic. As a software developer, I could relate to the fact that most of us are not good at thinking through all the permutation and combinations. So, even though we write unit tests they do not cover all scenarios. I think code coverage tools and mutation testing can help here.

It is now universally agreed that unit testing helps reduce bugs in software. Last few years, I have worked with few big enterprises and I can attest most of their code didn’t had unit tests and even if parts of the code had unit tests those tests were useless. So, even though open source projects that we use are getting better through unit testing most of the code that an average developer writes has a long way to go. One thing that we can learn from this paper is to start write high quality tests.

The paper mentions specific events where most of the bugs happen. Some of these events are:

Starting up services
Unreachable nodes
Configuration changes
Adding a node

If you are building distributed application, then you can try to test your application for these events. If you are building applications that uses Microservices based architecture then these are interesting events for your application as well. For example, if you call a service that is not available how your system behaves.

As per the paper, these mature open-source systems has mature logging.

76% of the failures print explicit failure related messages.

Paper mentions three reasons why that is the case:

First, since distributed systems are more complex, and harder to debug, developers likely pay more attention to logging.
Second, the horizontal scalability of these systems makes the performance overhead of outputing log message less critical.
Third, communicating through message-passing provides natural points to log messages; for example, if two nodes cannot communicate with each other because of a network problem, both have the opportunity to log the error.

Authors of the paper built a static analysis tool called Aspirator for locating these bug patterns.

If Aspirator had been used and the captured bugs fixed, 33% of the Cassandra, HBase, HDFS, and MapReduce’s catastrophic failures we studied could have been prevented.

Overall, I enjoyed reading this paper. I found it easy to read and applicable to all software developers.