Paper: Expect the Unexpected: FailSafe Long Context QA for Finance

I am always looking for practical, real-world papers that can help in my work. I provide AI and LLM-related consultancy to multiple clients, most of whom are in the financial domain. One of the first things I do as part of my consulting engagement is create test datasets that can help me baseline and improve AI systems. This usually requires spending time with business/domain folks and looking at/reading/analyzing a lot of data. Today, I stumbled upon a paper Expect the Unexpected: FailSafe Long Context QA for Finance that details how they created a realistic dataset specific to the financial domain. For each record in the dataset, they have introduced query and context perturbations and evaluated how different models perform. They have done benchmarking on both reasoning and non-reasoning models. The paper covers two main aspects:

Testing how well LLMs handle real-world variations in queries and document quality when processing financial information
Focusing on long-context scenarios (like 10-K reports) where accuracy is crucial

Dataset Overview

The best part of the FailSafe dataset they have released on HuggingFace is that it mirrors real-world interactions between users and query-answer systems (AI assistants). The FailSafeQA dataset helps evaluate LLM resilience against variations in human input within the financial sector caused by varying domain expertise, query incompleteness, source irrelevance, and linguistic inaccuracies. Their dataset is based on 10-K annual reports. For each record in the dataset they have the following fields:

tokens: Number of tokens in the context, ranging from 4,000 to 30,000 input tokens. With long context windows, this is a common scenario these days.
query: The actual user query
answer: The correct answer
context: The long context that contains the correct answer
error_query: The incorrect version of the query that should yield no answer
incomplete_query: Partial, not well-formed queries that are keyword-based searches
out_of_domain_query: Questions posed by users without in-depth expertise, where the wording may differ from expert phrasing but should still yield the same answer if the intent is clear
out_of_scope_query: Queries with no relevance to input content
ocr_context: The OCR version of the input context

I suggest you to look at one of the records of the dataset to get better understanding of the dataset. Below is an image from paper that can help understand the dataset.

Dataset Generation Process

They used LLMs for generating the dataset through a multi-step query generation phase:

Generated multi-turn query and answer pairs
Identified the best standalone query
Rewrote the query to make clear, standalone questions
Extracted citations using the LongCite-llama3.1-8b model from the full context
Only kept questions where citations adequately support the query response

While they used multi-turn conversations in the generation process, the final dataset contains standalone questions. Although the reason for generating multi-turn conversations isn’t explicitly clear, it might have helped in creating more natural and diverse questions.

Query and Context Perturbations

For query perturbations, they used Meta’s Llama 3.1 405B to generate three types:

Misspelled queries
Incomplete queries (mimicking keyword-based search behavior)
Out-of-Domain queries

For context perturbations, they introduced:

Missing context: Simply omitted the context
OCR errors: Used the OCR errors simulator with a 10% character error probability cap, chosen to balance readability and realistic error occurrence
Irrelevant context: Randomly paired queries with irrelevant contexts, manually verifying their irrelevance

Key Findings

The benchmark assessed models in two scenarios:

Providing robust answers (Baseline Query, Misspelled Query, Incomplete Query, Out-of-Domain Query, OCRed Context)
Declining to answer when justified (Missing Context, Irrelevant Context)

The results revealed that models are generally better at delivering appropriate answers than at refusing to answer when the context is insufficient.

Robustness Results

OpenAI’s o3-mini performed best, achieving a robustness score of 0.90
Major performance drops occurred with OCR context
Most models handled misspelled and incomplete (keyword-based) queries well
Smaller models (Llama 8B, Deepseek R1 Llama 8b, Phi-3) struggled most with OCR context
All models showed some performance drop compared to their baseline

Context Grounding

One particularly concerning finding was that reasoning-focused models like DeepSeek-R1, DeepSeek-R1-Distill, and OpenAI o1/o3-mini (which are typically praised for their reasoning capabilities) fabricated information in 41% to 70% of test cases when they should have declined to answer. The absence of context posed the biggest challenge, with many models, especially reasoning models, generating answers even when no context was provided.

This study provides valuable insights for anyone working on implementing LLMs in financial applications, highlighting the importance of robust testing and the need to carefully consider how models handle imperfect inputs and situations where they should decline to answer.

Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.