Setting a realistic accuracy target for LLM tasks

Today I was reading OpenAI guide on model selection https://platform.openai.com/docs/guides/model-selection where they explained how to calculate a reaslistic accuracy target for LLM task by evaluating financial impact of model decisions. They gave an example of fake news classifier.

In a fake news classification scenario:

Correctly classified news: If the model classifies it correctly, it saves you the cost of a human reviewing it – let’s assume $50.

Incorrectly classified news: If it falsely classifies a safe article or misses a fake news article, it may trigger a review process and possible complaint, which might cost us $300.

Our news classification example would need 85.8% accuracy to cover costs, so targeting 90% or more ensures an overall return on investment. Use these calculations to set an effective accuracy target based on your specific cost structures.

This is a good way to find the accuracy you need for the task. Break-even accuracy is calculated using the below formula

Break-even Accuracy = Cost of Wrong ÷ (Cost of Wrong + Cost Savings)

Below break-even you lose money. At break-even you break even. Above break-even you make profit. The target accuracy ensures your desired ROI. I have built a simple calculator that you can use https://tools.o14.ai/llm-task-accuracy-calculator.html.

This calculation works well when you are making binary decisions with LLMs. These are usually classification tasks. Below are some examples.

Content Moderation

Correct: Properly moderate content → Save $20 in manual review
Incorrect: Miss harmful content or over-moderate → $500 in legal/PR costs

Resume Screening

Correct: Identify good candidate → Save $100 in recruiter time
Incorrect: Miss good candidate or pass bad one → $2,000 in hiring costs

Code Review Automation

Correct: Catch bug before production → Save $200 in developer time
Incorrect: Miss critical bug → $10,000 in downtime costs

You can also adapt it to other domains like customer service chatbots. Instead of correct/incorrect, use quality tiers:

High quality response: Save $25 (avoid human agent)
Medium quality: $10 cost (requires follow-up)
Poor quality: $75 cost (frustrated customer + human intervention)

Any LLM task with measurable business impact can use cost-benefit analysis – you just need to define what “quality” means for your specific use case and map it to financial outcomes.

Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.

Discover more from Shekhar Gulati

Share this:

Related

Leave a comment Cancel reply