I am always looking for practical, real-world papers that can help in my work. I provide AI and LLM-related consultancy to multiple clients, most of whom are in the financial domain. One of the first things I do as part of my consulting engagement is create test datasets that can help me baseline and improve AI systems. This usually requires spending time with business/domain folks and looking at/reading/analyzing a lot of data. Today, I stumbled upon a paper Expect the Unexpected: FailSafe Long Context QA for Finance that details how they created a realistic dataset specific to the financial domain. For each record in the dataset, they have introduced query and context perturbations and evaluated how different models perform. They have done benchmarking on both reasoning and non-reasoning models. The paper covers two main aspects:
- Testing how well LLMs handle real-world variations in queries and document quality when processing financial information
- Focusing on long-context scenarios (like 10-K reports) where accuracy is crucial
Dataset Overview
The best part of the FailSafe dataset they have released on HuggingFace is that it mirrors real-world interactions between users and query-answer systems (AI assistants). The FailSafeQA dataset helps evaluate LLM resilience against variations in human input within the financial sector caused by varying domain expertise, query incompleteness, source irrelevance, and linguistic inaccuracies. Their dataset is based on 10-K annual reports. For each record in the dataset they have the following fields:
- tokens: Number of tokens in the context, ranging from 4,000 to 30,000 input tokens. With long context windows, this is a common scenario these days.
- query: The actual user query
- answer: The correct answer
- context: The long context that contains the correct answer
- error_query: The incorrect version of the query that should yield no answer
- incomplete_query: Partial, not well-formed queries that are keyword-based searches
- out_of_domain_query: Questions posed by users without in-depth expertise, where the wording may differ from expert phrasing but should still yield the same answer if the intent is clear
- out_of_scope_query: Queries with no relevance to input content
- ocr_context: The OCR version of the input context
I suggest you to look at one of the records of the dataset to get better understanding of the dataset. Below is an image from paper that can help understand the dataset.

Dataset Generation Process
They used LLMs for generating the dataset through a multi-step query generation phase:
- Generated multi-turn query and answer pairs
- Identified the best standalone query
- Rewrote the query to make clear, standalone questions
- Extracted citations using the LongCite-llama3.1-8b model from the full context
- Only kept questions where citations adequately support the query response
While they used multi-turn conversations in the generation process, the final dataset contains standalone questions. Although the reason for generating multi-turn conversations isn’t explicitly clear, it might have helped in creating more natural and diverse questions.
Query and Context Perturbations
For query perturbations, they used Meta’s Llama 3.1 405B to generate three types:
- Misspelled queries
- Incomplete queries (mimicking keyword-based search behavior)
- Out-of-Domain queries
For context perturbations, they introduced:
- Missing context: Simply omitted the context
- OCR errors: Used the OCR errors simulator with a 10% character error probability cap, chosen to balance readability and realistic error occurrence
- Irrelevant context: Randomly paired queries with irrelevant contexts, manually verifying their irrelevance
Key Findings
The benchmark assessed models in two scenarios:
- Providing robust answers (Baseline Query, Misspelled Query, Incomplete Query, Out-of-Domain Query, OCRed Context)
- Declining to answer when justified (Missing Context, Irrelevant Context)
The results revealed that models are generally better at delivering appropriate answers than at refusing to answer when the context is insufficient.
Robustness Results

- OpenAI’s o3-mini performed best, achieving a robustness score of 0.90
- Major performance drops occurred with OCR context
- Most models handled misspelled and incomplete (keyword-based) queries well
- Smaller models (Llama 8B, Deepseek R1 Llama 8b, Phi-3) struggled most with OCR context
- All models showed some performance drop compared to their baseline
Context Grounding

One particularly concerning finding was that reasoning-focused models like DeepSeek-R1, DeepSeek-R1-Distill, and OpenAI o1/o3-mini (which are typically praised for their reasoning capabilities) fabricated information in 41% to 70% of test cases when they should have declined to answer. The absence of context posed the biggest challenge, with many models, especially reasoning models, generating answers even when no context was provided.
This study provides valuable insights for anyone working on implementing LLMs in financial applications, highlighting the importance of robust testing and the need to carefully consider how models handle imperfect inputs and situations where they should decline to answer.
Discover more from Shekhar Gulati
Subscribe to get the latest posts sent to your email.