Case study: Building a LLM based workflow configuration generator for a low code product

I run a small indie consulting firm that specializes in building LLM-based solutions. Recently, I worked on an LLM solution where we had to generate JSON-based workflow configurations for a low code product. Think of it like AWS Step Functions where you write your business workflow in a JSON configuration file. In this post, I will share lessons learned while building this solution.

Below is an example workflow configuration. It is a simplified example. In their case, a step can have close to 50 fields.

{
    "name" : "Name of the workflow",
    "description" : "Description of the workflow",
    "steps" : {
        "step_1" : {
            "async" : false,
            "sql_query_name" : "Name of the query to execute",
            "transform_request" : "Can be a Go template",
            "transform_response" : "Can be a Go template",
            "steps" : {
                "step_1_1" : {

                },
                "step_1_2" : {

                }
            }
        },
        "step_2" : {
            "async" : true,
            "function_to_execute" : "Name of the query to execute",
            "transform_request" : "Can be a Go template",
            "transform_response" : "Can be a Go template",
            "steps" : {
                "step_2_1" : {

                },
                "step_2_2" : {

                }
            }
        }
    }
}

Important points to note in the above JSON configuration:

The workflow configuration is recursive. A step can have steps, and those steps can have further steps, and so on.
Step names follow a pattern ^[a-z]+(_[a-z]+)*$.
Certain JSON attributes require us to generate valid Go templates. These Go templates use some reusable library functions.

My customer wanted to automate the generation of workflow configuration for the following reasons:

Productivity and Efficiency: It takes their current implementation engineers 3-5 days to create a valid workflow configuration.
Reliability: Writing these configuration files manually is an error-prone task, and different implementation engineers write workflows in different ways. They follow different naming conventions and use different functions.
Approachability: They wanted to make it easy for their end customers to generate these workflow files. This way, non-technical users or those unfamiliar with their JSON schema can also write these workflow files.
Get AI/LLM capabilities in their product.

We implemented this solution using the following tech stack:

Python
FastAPI
OpenAI gpt-4o-mini model
Pydantic with Instructor Python Library

Now that I have given you an understanding of the problem, let me cover important things you should keep in mind when building such solutions.

1. Start with the JSON schema

When you look at this problem, it looks like a good fit for the structured output feature that most LLMs support. From the OpenAI structured output docs:

Structured Outputs is a feature that ensures the model will always generate responses that adhere to your supplied JSON Schema, so you don’t need to worry about the model omitting a required key, or hallucinating an invalid enum value.

To use structured outputs, you need to have a JSON schema. In my case, the customer gave me a 10-page document describing different attributes of their schema. I had to work with them to create the JSON schema from valid JSON configurations. Valid workflow configuration files also help you understand the typical token length of the output. The gpt-4o and gpt-4o-mini maximum output token length is 16384. In our case, configuration files were in the 4000-6000 token range. You should keep in mind that models struggle to generate long outputs, so thorough testing is essential.

The time you spend understanding the schema will help you understand the complexity of the problem. Do not take any shortcuts here. If you don’t know what you are generating, you will not be able to do a good job.

Once you have some understanding of the JSON schema, try to see if your schema is fully supported by OpenAI. OpenAI Structured Output feature only supports a subset of the JSON schema language. You should read the documentation to understand the limitations https://platform.openai.com/docs/guides/structured-outputs#supported-schemas. For my JSON schema use case, the following are not supported:

For object patternProperties is not supported
All fields must be required. You can solve this by using union type, but this will require you to update your JSON schema and transform the generated JSON to remove null fields
A schema may have up to 100 object properties total, with up to 5 levels of nesting. We had 80 properties in total
For string type minLength, maxLength, pattern, format are not supported

2. Understand how users will use it

For this use case, there are two possible UX approaches:

A Chat interface that helps incrementally build the workflow configuration
Web IDE copilot. It is also possible to build a GitHub Copilot-like experience in a web editor like CodeMirror

We went with the chat interface that they integrated into their product.

Now that you know we will use a chat interface, it is important to understand how users will prompt the system to generate configuration. I asked the customer what prompts they will write and what responses they expect the system to return. This is where the actual complexity of an application starts to emerge. You have to understand the following:

How much information should we expect users to provide? Will they provide long detailed prompts or short prompts?
If a user is not providing enough information in the prompt, should we ask follow-up questions for more information, use sensible defaults, or fetch relevant context from a retrieval system?
How long will the conversations be? Do we need to do any intelligent context management?
How will we know the generated JSON is valid? Structured output will ensure that generated JSON meets the JSON schema, but in our case, we have to make sure that generated Go templates are also valid. Our customer exposed an API that we would call to check if JSON is valid. If it is not, we regenerated the JSON using the feedback from the validation endpoint.

3. Have two domain models

To overcome structured output limitations, we decided to use two domain models – internal and external. We used Pydantic to model our domain classes.

Our external models map to the customer JSON schema as shown below. Below is a simplified example.

class WorkflowConfiguration(BaseModel):
    name: str
    description: str
    steps: dict[str, StepConfiguration]


class StepConfiguration(BaseModel):
    async: Optional[str] = Field(default=None)
    sql_query_name: Optional[str]  = Field(default=None) 
    function_to_execute: Optional[str]  = Field(default=None) 
    transform_request: Optional[str] = Field(default=None)    
    transform_response: Optional[str] = Field(default=None)

Next, we have the internal model.

class InternalWorkflowConfiguration(BaseModel):
    name: str = Field(description="The name of the workflow.")
    description: str = Field(description="The description of the workflow")
    steps: List[InternalStepConfiguration] = Field(
        description=dedent("""Zero or more steps that make the workflow. 
        You have to use the information in user prompt to create these steps."""),
        default=[])


class InternalStepConfiguration(BaseModel):
    name: str = Field(description="The name of the step", pattern="^[a-z]+(_[a-z0-9]+)*$")
    a_async: bool = Field(alias="async", serialization_alias="async",
                          description="True if execution of this step has to be done asynchronously else false", default=False)
    transform_request: Optional[str] = Field(default=None)    
    transform_response: Optional[str] = Field(description=dedent("""This field defines how to transform the request. 
    It is an escaped stringified JSON that supports Go template syntax. Keep following in mind when transforming request:
    - You can access original body using the `.Vars.OrgBody` object.  
    - Generated string needs to be escaped
    - You can use custom functions defined in the system prompt to transform request
    - You can use Go template functions from the sprig library"""), default=None)

Your LLM would return the instance of InternalWorkflowConfiguration. We convert it to WorkflowConfiguration in our code.

workflow_config = WorkflowConfiguration(
        name=config.name,
        description=config.description,
        steps={s.name: convert(s) for s in config.steps})
    json_config = workflow_config.model_dump_json(exclude_none=True, by_alias=True,
                                             exclude={'steps': {'__all__': {'name'}}})
    validation_result = validate(config=workflow_config)

4. Write test cases

It is important in LLM applications that you write tests that help you determine if you are generating correct answers. Also, as you make changes, you want to ensure you are not introducing any regression.

from autoevals import ValidJSON

def test_workflow_generation():
    user_prompt = dedent("""User prompt""")
    messages = [Message(content=user_prompt, role="user")]

    output_json, _ = generate(messages=messages)

    evaluator = ValidJSON()
    json_schema = json.loads(read_file_as_str(filename='../json_schema.json'))
    assert evaluator(output_json, json_schema).score == 1

    step = json.loads(output_json)['steps']['step_1']
    delay = step['delay']
    assert delay == 5
    assert step['loop_in_parallel']
    assert step['loop_variable'] == '["foo","bar","tar"]'

We both validate that generated JSON is valid by validating it against JSON schema and check the step values.

We also wrote extensive tests for our Chat API to test multi-turn conversations.

5. Validation and retries are your friends

We had business rules like one of the four fields can be present or rules like foo is required if bar is present. We codified them as part of Pydantic validations. If Pydantic validation failed, we would retry the request passing in the feedback.

@model_validator(mode="after")
def check_at_least_one_of(cls, model):
    non_none_fields = [field for field in (model.f1, model.f2, model.f3, model.f4) if
                       field is not None]
    if len(non_none_fields) != 1:
        raise ValueError("Exactly one of 'f1', 'f2', 'f3', or 'f4' must be set.")
    return model

We were able to increase our accuracy from 70% to 90% using this simple technique. We limited the number of retries to 3.

6. Create human readable error messages using LLM

We rely on Pydantic to ensure our objects are valid. There are times when OpenAI fails to generate valid configuration. When this happens, Pydantic throws an error. Since Pydantic error messages can be difficult to read, we generate a human-readable error message by prompting OpenAI as shown below.

Generate human readable error message from the below error message. Do not include any URLs or python library details.
---
{message}

Conclusion

Building an LLM-based workflow configuration generator presented unique challenges, from handling complex JSON schemas to ensuring reliable output generation. The key to success lay in thorough understanding of the problem domain, implementing robust validation mechanisms, and creating a user-friendly experience. Through careful architecture decisions and comprehensive testing, we achieved a 90% accuracy rate in generating valid configurations.

If you’re looking to build production-ready LLM applications or need help with similar challenges, our consulting firm specializes in developing robust LLM solutions. Feel free to reach out to me for a consultation about your specific use case.

Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.