Extracting structured output with instructor LLM library


For the past year, I’ve been building applications using Large Language Models (LLMs). A common task I’ve used LLMs for is extracting structured information from unstructured text.

Imagine you’re an IT service provider like Accenture with a set of customer case studies. You want to extract specific information from each case study, such as: customer name, industry, the business problem addressed, technologies used in the solution, whether it’s a cloud-native solution or not, if it utilizes AI, and the project duration.

Take this case study for example from Accenture website https://www.accenture.com/in-en/case-studiesnew/cloud/axa-cloud we want to extract the information into a Python class shown below.

from pydantic import BaseModel

class CaseStudy(BaseModel):
    client_name: str
    industry: str
    country: str
    business_problem: str
    technologies: list[str]
    is_cloud_native_solution: bool
    uses_ai: bool
    project_duration: int 

If you have been following LLMs you know that most closed source(even some open source) LLMs support structured outputs. For example, OpenAI has JSON mode that always return JSON object that make sense for your use case.

Let’s use OpenAI JSON mode to extract the relevant information.

Read more: Extracting structured output with instructor LLM library

I am building a course on how to build production apps using LLMs. We will cover topics like prompt engineering, RAG, search, testing and evals, fine tuning, feedback analysis, and agents. You can register now and get 50% discount. Register using form – https://forms.gle/twuVNs9SeHzMt8q68

OpenAI JSON Mode

We will use official OpenAI Python libray to extract the information. You can install it using pip install openai.

from openai import OpenAI
import json

client = OpenAI()

page_text = extract_text('https://www.accenture.com/in-en/case-studiesnew/cloud/axa-cloud')

system_prompt = f"""You are an expert at extracting information from any web page text that can be passed other applications.
Take a deep break and think step by step about how to best accomplish this goal using the following steps.

# STEPS
- Read the whole text so that you fully understand it
- Extract the client name for whom we have done this work
- Extract the client industry for whom we have done this work
- Extract the client country mentioned in the input
- Summarize the business problem covering important points discussed in the text
- Extract list of technologies mentioned in the text
- Classify whether this is a cloud native solution or not. Cloud-native is the use of open source software as well as technologies such as containers, microservices and service mesh to develop and deploy scalable applications on cloud computing platforms
- Classify whether we used Artificial Intelligence in this project or not.
- Extract the project duration in months

# OUTPUT
- Output a JSON that captures all the information mentioned in the STEPS section

# INPUT

{page_text}
"""

response = client.chat.completions.create(
    model="gpt-3.5-turbo-0125",
      temperature=0.0,
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": system_prompt},
    ]
)

json_str = response.choices[0].message.content

print(json.loads(json_str))

Let’s look at important points in the above code.

  1. The most interesting part in the above code is the system prompt. As you can see we have clearly defined how we want model to extract different attributes. We have also used Zero Shot Chain of Thought prompt engineering technique to help model generate reasoning chains.

A system prompt is the initial input or instruction given to the model to set the stage for its responses. It typically includes guidelines, context, or specific instructions to shape the behavior and output of the model. The system prompt helps in defining the role of the model, the style of responses, and any constraints or rules it should follow.

  1. Next, important aspect is the use of response_format parameter of create method. We specified that we expect response format to be JSON. The other valid value is text. text is the default value when you do not specify any value.
  2. You have to mention JSON in your system prompt. From the OpenAI docs, if you don’t include an explicit instruction to generate JSON, the model may generate an unending stream of whitespace and the request may run continually until it reaches the token limit. To help ensure you don’t forget, the API will throw an error if the string "JSON" does not appear somewhere in the context.

If you run the above code then you will see following output

{'client_name': 'AXA',
 'client_industry': 'Insurance',
 'client_country': 'Belgium',
 'business_problem_summary': "AXA, the largest insurer in Belgium, faced challenges in the property and casualty insurance sector due to strict regulations, security threats, and legacy claims systems. With Accenture's help, they implemented the Guidewire ClaimCenter claims management system in a cloud environment to improve agility, reduce costs, and enhance customer service.",
 'technologies_used': ['Guidewire ClaimCenter', 'Amazon Web Services (AWS)'],
 'cloud_native_solution': True,
 'artificial_intelligence_used': False,
 'project_duration_months': 9}

If you have read the case study link you will agree that gpt-3.5-turbo-0125 model did a good job in extracting the relevant information.

I don’t think I will consider this case study a cloud native solution but other than that it looks fine.

The response returned by OpenAI also includes the number of input and output tokens. OpenAI costing is done using these input and output tokens. For gpt-3.5-turbo, output tokens are three times more expensive than input tokens. The same is true for the latest gpt-4o model as well.

response.usage
CompletionUsage(completion_tokens=150, prompt_tokens=1400, total_tokens=1550)

We consumed 1,400 input tokens and 150 output tokens. At a cost of $0.0009 per API execution, this translates to roughly 1,080 API calls per dollar. Therefore, processing 1 million requests would cost approximately $925.

To optimize costs, we can reduce input tokens by cleaning the input data while maintaining accuracy.

Let’s change the model to gpt-4o and compare the response with gpt-3.5-turbo model response. Below is the gpt-4o response.

{'client_name': 'AXA',
 'client_industry': 'Insurance',
 'client_country': 'Belgium',
 'business_problem_summary': 'AXA, the largest insurer in Belgium, faced challenges in the highly competitive property and casualty insurance sector due to strict regulations, security threats, and the burden of legacy claims systems. They needed to migrate to a cloud platform to improve agility, control costs, and enhance customer service.',
 'technologies': ['Guidewire ClaimCenter',
  'Amazon Web Services (AWS)',
  'cloud platforms',
  'application programming interfaces (APIs)'],
 'cloud_native_solution': True,
 'used_artificial_intelligence': False,
 'project_duration_months': 9}

Below is the token usage.

CompletionUsage(completion_tokens=157, prompt_tokens=1385, total_tokens=1542)

The gpt-4o model will cost $9280.00 for 1 million executions. It is rougly 10 times more expensive compared to gpt-3.5-turbo-0125 model.

If you run the above code multiple times you will notice that names of the JSON object attribute change. For example, sometimes model generate artificial_intelligence_used as field name other time it generates used_artificial_intelligence as the field name. From the OpenAI API docs

JSON mode will not guarantee the output matches any specific schema, only that it is valid and parses without errors.

To handle this issue it is recommended to send example JSON structure in the input as well. We will use gpt-3.5-turbo-0125 model.

Let’s change the system prompt to include the output format as well. Now, we are doing few shot chain of thought prompting.

example_case_study = CaseStudy(
    client_name='Example Client Name',
    industry='Example Industry',
    country='India',
    business_problem='A chatbot solution',
    technologies=['Java', "AWS", "Postgres", "OpenAI", "GPT"],
    is_cloud_native_solution=True,
    uses_ai=True,
    project_duration=10,
)

json_example = example_case_study.json()

system_prompt = f"""You are an expert at extracting information from any web page text that can be passed other applications.
Take a deep break and think step by step about how to best accomplish this goal using the following steps.

# STEPS
- Read the whole text so that you fully understand it
- Extract the client name for whom we have done this work
- Extract the client industry for whom we have done this work
- Extract the client country mentioned in the input
- Summarize the business problem covering important points discussed in the text
- Extract list of technologies mentioned in the text
- Classify whether this is a cloud native solution or not. Cloud-native is the use of open source software as well as technologies such as containers, microservices and service mesh to develop and deploy scalable applications on cloud computing platforms
- Classify whether we used Artificial Intelligence in this project or not.
- Extract the project duration in months

# OUTPUT INSTRUCTIONS
- Output a JSON that captures all the information mentioned in the STEPS section. 

# OUTPUT FORMAT

Use the following JSON format

{json_example}

# INPUT

{page_text}
"""

In the code snippet shown above:

  • We created an example case study object and converted it to JSON using Pydantic json helper method.
  • Next, we modified the system prompt by adding OUTPUT FORMAT section. and specifying our JSON example.

If we run the code now, you might see output as shown below.

{'client_name': 'AXA',
 'industry': 'Insurance',
 'country': 'Belgium',
 'business_problem': 'Implementing a cloud-native solution for claims management to improve agility, customer service, and cost transparency.',
 'technologies': ['Guidewire ClaimCenter',
  'Amazon Web Services (AWS)',
  'APIs'],
 'is_cloud_native_solution': True,
 'uses_ai': False,
 'project_duration': 9}

Just like the first time extracted information looks fine.

CompletionUsage(completion_tokens=98, prompt_tokens=1471, total_tokens=1569)

This will cost $882 for 1 million calls.

You might be wondering if OpenAI JSON mode does such a good job then why do we need to include a library like instructor in our application.

Introducing instructor

Instructor makes it easy to get structured data like JSON from LLMs like GPT-3.5, GPT-4, GPT-4-Vision, and open-source models including Mistral/Mixtral, Anyscale, Ollama, and llama-cpp-python.

It supports multiple programming languages:

Let’s look at reasons why you might want to use instructor instead of OpenAI JSON mode.

1. Instructor can help reduce cost

Let’s use instructor library and compare its token usage for our same example.

Install the library using pip

pip install -U instructor

Next, we will patch the OpenAI client. I will later explain how patching works in the next blog. For now, think it as an implementation of decorator pattern.

import instructor

ins_client = instructor.from_openai(client)

case_study, completion_response = ins_client.chat.completions.create_with_completion(
    model="gpt-3.5-turbo",
    response_model=CaseStudy,
    messages=[
        {
            "role": "user",
            "content": page_text
        }
    ]
)

pretty_print(case_study.dict())

While JSON mode enforces the format of the output, it does not help with validation against a specified schema. Directly passing in a schema may not generate expected JSON and may require additional careful formatting and prompting.

Here you will see that we are not specifying any system prompt. We are only specifying the response model and the page text.

When you run the above code you will get following JSON response.

{
    "client_name": "AXA",
    "industry": "Insurance",
    "country": "Belgium",
    "business_problem": "To modernize the claims management system by migrating to the cloud, implementing Guidewire ClaimCenter, and improving automation and efficiency.",
    "technologies": [
        "Amazon Web Services (AWS)",
        "Guidewire ClaimCenter"
    ],
    "is_cloud_native_solution": true,
    "uses_ai": false,
    "project_duration": 9
}

It is almost similar to the JSON generated by OpenAI JSON mode.

Let’s look at the Token usage.

completion_response.usage
CompletionUsage(completion_tokens=75, prompt_tokens=1319, total_tokens=1394)

Earlier with OpenAI JSON mode we were consuming following tokens

CompletionUsage(completion_tokens=98, prompt_tokens=1471, total_tokens=1569)

As you can see we are consumig 25% less output tokens. Input tokens are more or less the same.

If we do the cost calculation with instructor then it will cost $ 772 for 1 million calls instead of $ 882 with OpenAI JSON mode. This is roughly 9% cost saving. Depending on your output JSON structure cost saving can be more.

To understand how instructor works it is helpful to look at the instructor executor logs. To do that we will enable logging.

import logging
import instructor

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

logging.getLogger('instructor').setLevel(logging.DEBUG)

Now, if you execute the code you will see instructor debug logs.

Below is the request body sent to OpenAI

openai._base_client - DEBUG - Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': '<<PAGE_TEXT>>'}], 'model': 'gpt-3.5-turbo', 'tool_choice': {'type': 'function', 'function': {'name': 'CaseStudy'}}, 'tools': [{'type': 'function', 'function': {'name': 'CaseStudy', 'description': 'Correctly extracted `CaseStudy` with all the required parameters with correct types', 'parameters': {'properties': {'client_name': {'title': 'Client Name', 'type': 'string'}, 'industry': {'title': 'Industry', 'type': 'string'}, 'country': {'title': 'Country', 'type': 'string'}, 'business_problem': {'title': 'Business Problem', 'type': 'string'}, 'technologies': {'items': {'type': 'string'}, 'title': 'Technologies', 'type': 'array'}, 'is_cloud_native_solution': {'title': 'Is Cloud Native Solution', 'type': 'boolean'}, 'uses_ai': {'title': 'Uses Ai', 'type': 'boolean'}, 'project_duration': {'title': 'Project Duration', 'type': 'integer'}}, 'required': ['business_problem', 'client_name', 'country', 'industry', 'is_cloud_native_solution', 'project_duration', 'technologies', 'uses_ai'], 'type': 'object'}}}]}}

The response received from OpenAI is shown below.

Instructor Raw Response: ChatCompletion(id='chatcmpl-9dIS4HE0k6roPMcCFkzGBwJUnwHYJ', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_KlxO4pBALVCAc5jkeOpQHj3n', function=Function(arguments='{"client_name":"AXA","industry":"Insurance","country":"Belgium","business_problem":"AXA needed to modernize its claims management system to stay competitive in the property and casualty insurance sector. The burden of legacy systems, strict regulations, and security threats were hindering their progress in adopting digital technologies, including cloud platforms. With the help of Accenture, AXA implemented Guidewire ClaimCenter and migrated to Amazon Web Services to enhance agility, reduce costs, and improve customer service.","technologies":["Guidewire ClaimCenter","Amazon Web Services"],"is_cloud_native_solution":true,"uses_ai":false,"project_duration":9}', name='CaseStudy'), type='function')]))], created=1719152748, model='gpt-3.5-turbo-0125', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=127, prompt_tokens=1319, total_tokens=1446))

One thing you will notice in the request body sent to OpenAI is that instructor does not use JSON mode. It uses function calling. By using 'tool_choice': {'type': 'function', 'function': {'name': 'CaseStudy'}} it forces the model to call CaseStudy function. To call CaseStudy function model extract information from the page text and set that as parameters of CaseStudy function. Instructor library then instantiate an instance of CaseStudy object using the arguments passed to the function. It never generates JSON. If you need JSON you can use json helper method provided by pydantic.

Generating JSON consumes more output tokens compared to Function calling.

If you want to use JSON mode with instructor then you can specify mode like as shown below.

ins_client = instructor.from_openai(client, mode=instructor.Mode.JSON)

If you look at instructor debug looks you will see the request sent to OpenAI.

openai._base_client - DEBUG - Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'system', 'content': '\n                As a genius expert, your task is to understand the content and provide\n                the parsed objects in json that match the following json_schema:\n\n\n                {\n  "properties": {\n    "client_name": {\n      "title": "Client Name",\n      "type": "string"\n    },\n    "industry": {\n      "title": "Industry",\n      "type": "string"\n    },\n    "country": {\n      "title": "Country",\n      "type": "string"\n    },\n    "business_problem": {\n      "title": "Business Problem",\n      "type": "string"\n    },\n    "technologies": {\n      "items": {\n        "type": "string"\n      },\n      "title": "Technologies",\n      "type": "array"\n    },\n    "is_cloud_native_solution": {\n      "title": "Is Cloud Native Solution",\n      "type": "boolean"\n    },\n    "uses_ai": {\n      "title": "Uses Ai",\n      "type": "boolean"\n    },\n    "project_duration": {\n      "title": "Project Duration",\n      "type": "integer"\n    }\n  },\n  "required": [\n    "client_name",\n    "industry",\n    "country",\n    "business_problem",\n    "technologies",\n    "is_cloud_native_solution",\n    "uses_ai",\n    "project_duration"\n  ],\n  "title": "CaseStudy",\n  "type": "object"\n}\n\n                Make sure to return an instance of the JSON, not the schema itself\n'}, {'role': 'user', 'content': '<<PAGE_TEXT>>'}], 'model': 'gpt-3.5-turbo', 'response_format': {'type': 'json_object'}}}

Instructor used following system prompt.

f"""
As a genius expert, your task is to understand the content and provide
the parsed objects in json that match the following json_schema:\n

{json.dumps(response_model.model_json_schema(), indent=2)}

Make sure to return an instance of the JSON, not the schema itself
"""

If you want to use YAML instead of JSON then you can use https://github.com/NowanIlfideme/pydantic-yaml that adds YAML support to Pydantic. This conversion is done in the application tier without costing you any OpenAI tokens.

2. Reduces coding mistakes because of structured prompt engineering

Instructor helps you think about the problem at the right level of abstraction. In our example, we are working with case studies. We defined the CaseStudy type and instructor did the rest. We didn’t have to write explict step of converting JSON to the required type. Working with types help reduce coding mistakes. In the OpenAI JSON mode example if we forgot to add one of the instructions - Extract the client country mentioned in the input in the sytsem prompt then you will notice that generated output will not hav country field.

{
    "client_name": "AXA",
    "industry": "Insurance",
    "business_problem": "AXA wanted to implement the Guidewire ClaimCenter claims management system and make the most of new technology by migrating to the cloud.",
    "technologies": [
        "Guidewire ClaimCenter",
        "Amazon Web Services (AWS)",
        "APIs"
    ],
    "is_cloud_native_solution": true,
    "uses_ai": false,
    "project_duration": 9
}

With instructor since we are not writing any natural language system prompt this mistake never happens. Prompt becomes simpler because you encode rules in the schema.

instructor also makes it easy to work with enum values. Let’s assume that we need industry to be an Enum instead of str field. We change our data model as shown below.

# Case Study Data Model

from enum import Enum


class Industry(Enum):
    INFORMATION_TECHNOLOGY = "Information Technology"
    HEALTHCARE = "Healthcare"
    MANUFACTURING = "Manufacturing"
    RETAIL = "Retail"
    INSURANCE = "Insurance"
    UNKNOWN = "Unknown"


class CaseStudy(BaseModel):
    client_name: str
    industry: Industry = Field(default=None,
                               description="Correctly assign one of the predefined industries to the case study. If no industry is specified, then use UNKNOWN.")
    country: str
    business_problem: str
    technologies: list[str]
    is_cloud_native_solution: bool
    uses_ai: bool
    project_duration: int

To prevent data misalignment, we can use Enums for standardized fields. Always include an “Other” option as a fallback so the model can signal uncertainty. In our case we have used UNKNOWN as the fallback value.

As you can see above we changed industry to use Industry enum. Also, we added explicit instruction on how to handle unknown values. The enum along with the description is now also sent to OpenAI.

openai._base_client - DEBUG - Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': '<<PAGE_TEXT>>'}], 'model': 'gpt-3.5-turbo', 'tool_choice': {'type': 'function', 'function': {'name': 'CaseStudy'}}, 'tools': [{'type': 'function', 'function': {'name': 'CaseStudy', 'description': 'Correctly extracted `CaseStudy` with all the required parameters with correct types', 'parameters': {'$defs': {'Industry': {'enum': ['Information Technology', 'Healthcare', 'Manufacturing', 'Retail', 'Insurance', 'Unknown'], 'title': 'Industry', 'type': 'string'}}, 'properties': {'client_name': {'title': 'Client Name', 'type': 'string'}, 'industry': {'allOf': [{'$ref': '#/$defs/Industry'}], 'default': None, 'description': 'Correctly assign one of the predefined industries to the case study. If no industry is specified, then use UNKNOWN.'}}, 'type': 'object'}}}]}}

Similarly, we can also add specific instruction to specific fields. For business_problem we added

class CaseStudy(BaseModel):
    # Rest removed for brevity
    business_problem: str = Field(default=None, 
                                  description="Summarize all the key points mentioned in the text. Make sure to include KPIs, business problems, architecture style in the generated summary")

If you run the code now you will see a much better summary.

{
"business_problem": "AXA, the largest insurer in Belgium, faced challenges with legacy claims systems and strict regulations in the property and casualty insurance sector. They needed to migrate to a cloud platform to improve agility, reduce costs, and enhance customer service. With the help of Accenture, they implemented the Guidewire ClaimCenter claims management system on Amazon Web Services (AWS) to create a next-generation claims capability. The migration was completed in just nine months, improving claims registration and settlement times, and introducing automation and cost transparency."
}

3. You can add validations

You can specify validations to ensure that extracted content meets specific criterias. Even after using an Enum model can return a value that is not in the predefined enum values. In this case you might want to fallback to a fallback value like Other.

To handle that we can add folowing validation code as shown below.

def check_industry_values_are_valid(v: str) -> Industry:
    try:
        return Industry[v]
    except:
        return Industry.Other

class CaseStudy(BaseModel):
  industry: Annotated[Industry, BeforeValidator(check_industry_values_are_valid)] = Field(...,
                                                                                            description="Correctly assign one of the predefined industries based on the customer industry. Check wikipedia for customer industry.")

Similarly we can also add validations like duration can’t be greater that 120.

4. Customizable and Extensible

instructor is highly customizable. You can do following:

  • Optional types: Can help reduce hallucinations
  • Customize system prompt
  • Caching
  • Different modes – function mode, JSON mode, JSON markdown mode
  • Same API to work with different LLM providers
  • You can specify examples using Field annotation
  • If you follow different naming convenstion then you can use title in Field
  • Excluded fields – not all fields need to be extracted from LLM. Some metadata will be file specific that we can extract outside LLM.

You can use intstructor to extract other useful metadata from your text. You do that by defining a Property class.

class Property(BaseModel):
    index: str = Field(..., description="Monotonically increasing id")
    key: str = Field(..., description="Must be snake_case")
    value: str

And, then changing the CaseStudy class to below.

class CaseStudy(BaseModel):
    # Rest remains same
    properties: list[Property] = Field(..., description="Numbered list of properties, should be exactly 5")

Note: We are limiting number of properties to 5.

Now, when we run the code we also get new fields that we might not have thought about initially.

{
    "client_name": "AXA",
    "industry": "Insurance",
    "country": "Belgium",
    "business_problem": "AXA needed to improve its claims management system to enhance customer service, agility, and cost transparency. They faced challenges with legacy systems, regulations, and security threats.",
    "technologies": [
        "Guidewire ClaimCenter",
        "Amazon Web Services (AWS)"
    ],
    "is_cloud_native_solution": true,
    "uses_ai": false,
    "project_duration": 9,
    "properties": [
        {
            "index": "1",
            "key": "benefits",
            "value": "Improved customer service, agility, and cost transparency"
        },
        {
            "index": "2",
            "key": "features",
            "value": "Automated workflows and over 100 intuitive APIs in production"
        },
        {
            "index": "3",
            "key": "implementation_time",
            "value": "Successfully completed AWS setup and migration in just nine months"
        },
        {
            "index": "4",
            "key": "claims_handling",
            "value": "40% of claims declared via digital channel, 90% cases confirmed within four hours, and 20% full claims volume handled via straight-through-processing (STP)"
        },
        {
            "index": "5",
            "key": "growth",
            "value":"Created a next-generation claims capability supporting AXA's vision for growth"
        }
    ]
}

Conclusion

I found instructor a useful abstraction that make it simple to write code that can extract structrued information.


Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.

Leave a comment