Using Pydantic MCP Run Python as an Open Source Alternative to OpenAI Code Interpreter

In the last blog I discussed how I use OpenAI Code Interpreter to do RAG over data (CSV, Excel, etc.) files. OpenAI Code Interpreter is a managed offering and it does have some limitations. So, I was looking for an open source alternative. I discovered Pydantic team’s MCP Run Python package. It is an MCP server that allows agents to execute Python code in a secure, sandboxed environment. It uses Pyodide to run Python code in a JavaScript environment with Deno, isolating execution from the host system.

Continue reading “Using Pydantic MCP Run Python as an Open Source Alternative to OpenAI Code Interpreter”

Paper: AbsenceBench: Why Language Models Struggle to Detect Missing Information

While large language models (LLMs) have achieved remarkable capabilities in processing long contexts and locating specific information, a recent paper reveals a surprising blind spot: they struggle significantly when asked to identify what’s missing. The AbsenceBench paper by researchers from University of Chicago and Stanford University exposes a fundamental limitation that has far-reaching implications for how we evaluate and deploy these models.

The Problem: Detecting Absence vs. Presence

Large language models excel at the “Needle in a Haystack” (NIAH) test, where they successfully locate specific information buried within long documents. However, AbsenceBench introduces the inverse challenge: given an original document and a modified version with deliberately removed content, can models identify what’s missing?

The results are sobering. Even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. This represents a dramatic performance gap compared to their near-perfect performance on information retrieval tasks.

The Research Methodology

The researchers designed AbsenceBench across three distinct domains to test different types of missing information:

  • Numerical sequences: Mathematical progressions with specific numbers removed
  • Poetry: Excerpts from the Gutenberg Poetry Corpus with missing lines
  • GitHub pull requests: Code diffs with deliberately omitted lines

The complete dataset is available on Hugging Face and contains 4,302 instances across all domains with an average context length of 5K tokens. You can easily load the dataset for experimentation:

from datasets import load_dataset

# Load specific domain
dataset = load_dataset("harveyfin/AbsenceBench", "poetry")
# Or load other domains: "sequences", "github_prs"

To demonstrate this limitation, below is a simple reproduction. You can view the complete implementation here.

from openai import OpenAI

client = OpenAI()

system_prompt = """You are helping a student practice memorizing poems.
The student will recite a poem, but they may have missed some lines.
Your task is to identify exactly which lines are missing from their recitation.
List only the missing lines, nothing else."""

user_message = f"""Here is the complete original poem:
{original_context}
Now, here is my recitation which may be missing some lines:
{modified_context}
What lines did I miss? Please list only the missing lines, nothing else."""

response = client.responses.create(
    model="gpt-4-1-mini",
    instructions=system_prompt,
    input=user_message,
    temperature=0
)

Striking Performance Results

The experimental results reveal the extent of this limitation. In one test with 72 lines deliberately removed from a document:

gpt-4.1-mini performance:

  • Identified correctly: 37 missing lines
  • Failed to identify: 35 missing lines
  • False positives: 132 lines incorrectly flagged as missing

o4-mini performance:

  • Identified correctly: 37 missing lines
  • Failed to identify: 35 missing lines
  • False positives: 6 lines incorrectly flagged as missing

While o4-mini significantly reduced false positives, both models still missed nearly half of the actual omissions, demonstrating that this isn’t simply a problem solved by more advanced reasoning capabilities.

The Attention Mechanism Hypothesis

The research suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to “gaps” in documents since these absences don’t correspond to any specific keys that can be attended to.

This insight is crucial for understanding why the problem exists. Traditional Transformer architecture excels at finding and attending to relevant information present in the input. However, when information is missing, there’s literally nothing for the attention mechanism to focus on—creating a fundamental blind spot in the model’s processing capabilities.

The Placeholder Solution

The researchers discovered a remarkable workaround: inserting placeholders where content is missing dramatically improves performance by 35.7% on average. This finding supports their hypothesis about attention mechanisms struggling with “gaps.” When placeholders provide something concrete to attend to, models can better identify the missing content.

This suggests that the issue isn’t necessarily about understanding absence conceptually, but rather about the architectural limitations of how Transformers process information gaps.

Real-World Implications

This limitation has serious implications for several critical applications:

LLM-as-a-Judge Systems: When AI models evaluate content, essays, or responses, their inability to detect missing information could lead to inflated scores and missed deficiencies.

Legal and Regulatory Analysis: As mentioned in the original research context, regulatory intelligence systems that compare document versions need to reliably identify what has been removed or changed between iterations.

Misinformation Detection: Detecting misinformation often requires identifying what key information is conspicuously absent from a claim or report.

Academic and Content Evaluation: Grading systems that rely on LLMs may fail to penalize incomplete responses appropriately.

Quality Assurance: Any system using LLMs to verify completeness of documentation, procedures, or reports faces inherent limitations.

A Broader Pattern of AI Limitations

The paper’s findings illuminate what researchers call the “jagged frontier” of AI capabilities—where models can be superhuman at one task while failing unexpectedly at a closely related one. As noted in recent analysis, “Long context models have been getting increasingly good at passing ‘Needle in a Haystack’ tests recently, but what about a problem in the opposite direction?”

This pattern suggests that current evaluation methods may be missing critical failure modes. The stark contrast between NIAH performance and AbsenceBench results highlights how misleading current evaluations might be if they ignore absence detection.

Future Directions

The research opens several important avenues for improvement:

Architecture Innovation: Developing attention mechanisms specifically designed to handle information gaps and absences.

Evaluation Framework Enhancement: Incorporating absence detection as a standard evaluation criterion alongside traditional benchmarks.

Training Methodology: Exploring ways to train models explicitly on absence detection tasks.

Placeholder Strategies: Investigating optimal approaches for using placeholders to improve absence detection in practical applications.

Conclusion

AbsenceBench reveals a fundamental blind spot in current language models that goes beyond simple performance metrics. The inability to reliably detect missing information represents a core limitation that could undermine trust in AI systems across numerous high-stakes applications.

Bottom line: While LLMs excel at finding needles in haystacks, they struggle significantly when the needle was never there to begin with. This limitation suggests we need both architectural innovations and more comprehensive evaluation frameworks that test for what models can’t see, not just what they can find.

As we continue to deploy these systems in critical applications, understanding and addressing this absence detection limitation becomes not just an interesting research problem, but a necessity for building truly reliable AI systems. The research provides a crucial foundation for developing more robust and trustworthy language models that can handle the full spectrum of information processing tasks—including recognizing when something important is missing.

Resources

Use OpenAI Code Interpreter To RAG over user data

When building RAG systems, one common challenge is helping users query their own data. Users often come with a couple of Excel files, Word documents, or CSV files and want to ask questions like “Which department has the highest expenses?” or “What are the trends in our sales data?” Traditional RAG approaches struggle here because they’re designed for large, pre-processed knowledge bases, not for ad-hoc analysis of user-uploaded files.

I am a big fan of OpenAI’s Code Interpreter feature for solving exactly this problem. The code interpreter allows models to write and run Python code in a sandboxed environment to solve tasks. It is available in all tiers of ChatGPT, so you might already have seen it in action. Last week, I used it to process a huge (50 sheets) Excel file and extract a structured JSON from it. It first generated the code, then executed the code against my Excel file, and then gave me a link to download the JSON. The best part is that you can iterate over the solution in a step-by-step manner throughout the complete conversation.

If you have data in Excel, CSV, or any other structured format like JSON or XML, then you can use the code interpreter tool to ask questions about the data. For me, this is a better way to do RAG over user data when the data is not huge. Unlike traditional RAG that requires preprocessing, embedding, and vector storage, the Code Interpreter approach lets users directly upload their files and start querying immediately.

Continue reading “Use OpenAI Code Interpreter To RAG over user data”