I have spent last few months working on a regulatory intelligence software. One of the important feature is extracting obligations from dense PDF documents. In this post I am sharing some of the lessons we’ve learned about architecting AI systems that work in production.
#1. Break complex tasks: List First, Analyze Later
One of our biggest breakthroughs came from realizing that obligation extraction isn’t a single-step process. Initially, we tried to extract complete, structured obligations in one pass, but this led to inconsistent results and missed obligations.
Our solution? A two-step approach that mirrors how human analysts work:
Step 1: Obligation Identification – Cast a wide net to find all potential obligation statements using trigger phrases like “shall”, “must”, “should”, and “is required to”. This agent prioritizes completeness over precision, ensuring we don’t miss anything.
async def identify_obligations(section_text):
prompt = """
Extract all obligation statements from this text.
Look for trigger phrases: shall, must, should, is required to
Return only the obligation statements as a list.
"""
return await identification_agent.run(prompt + section_text)
Step 2: Detailed Analysis – Take each identified obligation and extract structured information: who is obligated, what they must do, under what conditions, and whether it’s a general requirement or regulatory power.
async def analyze_obligation(obligation_text, context):
prompt = """
Analyze this obligation and extract:
- obligated_party: Who must comply
- conditions: When/how it applies
- is_general_requirement: Boolean
- is_regulatory_power: Boolean
"""
return await analysis_agent.run(prompt, obligation_text, context)
This separation of concerns dramatically improved our recall rate. The identification agent can focus purely on finding obligations without getting bogged down in complex structuring tasks.
#2. The Challenge of Context and Relevance
One of our first major hurdles was dealing with irrelevant sections. PDFs contain introductions, definitions, annexures, and pure tabular data that don’t contain actual obligations. Processing these sections wasted compute resources and introduced noise.
We solved this with a dedicated relevance-checking agent that filters out non-relevant sections before processing:
async def check_relevance(section):
prompt = """
Determine if this section contains regulatory obligations.
EXCLUDE: introductions, definitions, annexures, tables
Return: {"is_relevant": boolean, "explanation": string}
"""
return await relevance_agent.run(prompt, section.title, section.text)
This simple addition reduced our processing time by 40% and improved overall accuracy by eliminating false positives from boilerplate text.
#3. Work with domain experts
Building an AI system without domain expertise is like navigating without a compass. We spent considerable time working with compliance officers to understand:
- How they identify obligations in regulatory texts
- What constitutes a “complete” obligation statement
- The difference between general requirements and specific regulatory powers
- How they handle edge cases and ambiguous language
This collaboration shaped our prompt engineering significantly. For instance, we learned that “should” in regulatory contexts is typically mandatory, not advisory – a nuance that dramatically affects extraction accuracy.
#4. Testing in the Real World
Our testing strategy goes beyond unit tests. We created comprehensive integration tests that validate against known obligation sets from real regulatory documents:
async def test_obligation_extraction():
# Test with real regulatory text
expected_obligations = [
"LFIs should seek to mitigate these risks...",
"The LFI should monitor whether the cash-intensive business..."
]
actual_obligations = await extract_obligations(test_section)
# Generate Excel report for manual review
create_comparison_report(expected_obligations, actual_obligations)
assert ....
Each test case includes:
- Expected obligations with exact text matching
- Section context validation
- Status tracking (match, missing, new)
- Excel report generation for manual review
The Excel export functionality proved invaluable for stakeholder validation. Domain experts could quickly review results, mark false positives/negatives, and provide feedback that directly informed our model improvements.
#5. Some problems can be solved by good UI/UX design
While the AI model gets the glory, the user interface often determines success or failure. We invested heavily in building intuitive UI components for:
- Side by side view: Give users the way to review PDF text and obligations side by side
- Adding obligations: Simple forms with auto-completion and validation
- Updating obligations: In-line editing with change tracking
- Merging obligations: Visual diff tools to combine similar obligations
- Deleting obligations: Soft deletes with undo functionality
- Bulk operations: Excel import/export for large-scale reviews
The custom navigation UI deserves special mention. We built a hierarchical view that lets users quickly jump between document sections, filter by obligation type, and track review progress. This seemingly simple feature reduced review time from hours to minutes.
Practical Architecture Decisions
Agent-Based Design: We used OpenAI Agents SDK. We created specialized agents for different tasks. This modular approach made debugging easier and allowed us to optimize each agent independently:
class ObligationExtractor:
def __init__(self, client, model):
self.relevance_agent = Agent(name="Relevance Checker", ...)
self.identification_agent = Agent(name="Obligation Identifier", ...)
self.analysis_agent = Agent(name="Obligation Analyzer", ...)
Context Sharing: A shared context object ensures all agents have access to the same document context without repetitive API calls:
@dataclass
class ObligationContext:
client: AsyncOpenAI
model: str
section: Section
section_text: str
obligation_statements: Optional[List[str]] = None
Async Processing: Given the volume of text we process, async/await patterns were essential for performance. Our extraction pipeline can process multiple sections concurrently.
Error Handling: Production AI systems fail in creative ways. We built comprehensive error handling with detailed logging to understand and fix edge cases quickly.
Key Takeaways for Technical Teams
- Start with domain experts: No amount of clever engineering can substitute for understanding the problem domain deeply.
- Design for iteration: Build systems that can easily incorporate feedback. Our structured output format made it simple to add new fields as requirements evolved.
- Test with real data: Synthetic test cases miss the complexity of real-world documents. Always validate against actual regulatory PDFs.
- Invest in tooling: Build excellent debugging, testing, and review tools. They’ll save you countless hours during development and maintenance.
- Plan for merge conflicts: When working with document extraction, you’ll inevitably face duplication issues. Design merge strategies from day one.
- UI matters as much as AI: The best AI system is useless if stakeholders can’t effectively review and validate results. Prioritize user experience equally with model performance.
Looking Forward
Building an AI-powered obligation extractor taught us that success lies not just in the AI model, but in the entire system surrounding it. The combination of thoughtful architecture, domain expertise integration, comprehensive testing, and intuitive user interfaces created a tool that actually works in production.
For teams embarking on similar projects, remember: the goal isn’t to replace human judgment but to augment it. Build systems that make experts more efficient, not obsolete. Focus on the full workflow, not just the AI component, and always prioritize the end-user experience.
Discover more from Shekhar Gulati
Subscribe to get the latest posts sent to your email.