For the past year, I’ve been building applications using Large Language Models (LLMs). A common task I’ve used LLMs for is extracting structured information from unstructured text.
Imagine you’re an IT service provider like Accenture with a set of customer case studies. You want to extract specific information from each case study, such as: customer name, industry, the business problem addressed, technologies used in the solution, whether it’s a cloud-native solution or not, if it utilizes AI, and the project duration.
Take this case study for example from Accenture website https://www.accenture.com/in-en/case-studiesnew/cloud/axa-cloud we want to extract the information into a Python class shown below.
from pydantic import BaseModel
class CaseStudy(BaseModel):
client_name: str
industry: str
country: str
business_problem: str
technologies: list[str]
is_cloud_native_solution: bool
uses_ai: bool
project_duration: int
If you have been following LLMs you know that most closed source(even some open source) LLMs support structured outputs. For example, OpenAI has JSON mode that always return JSON object that make sense for your use case.
Let’s use OpenAI JSON mode to extract the relevant information.
Read more: Extracting structured output with instructor LLM library
Continue reading “Extracting structured output with instructor LLM library”