OpenAI recently released a new feature in their API called Predicted Outputs, designed to reduce the latency of model responses in scenarios where much of the response is already known in advance. A good use case for this is when making incremental changes to code or text. I am currently working on an LLM use case for a customer, building a JSON document generator for their low-code tool. The user interacts through a chat interface to incrementally build the JSON document. I decided to give Predicted Outputs a try to see how it performs with this use case.
Let’s take an example where we have the following JSON. It defines a workflow that installs a dependency using pip:
{
"name": "Example Workflow",
"steps": {
"step1": {
"run": "pip install abc"
}
}
}
Now, suppose the user wants to add a step 2 that runs a Python file, main.py. The resulting JSON should look like this:
{
"name": "Example Workflow",
"steps": {
"step1": {
"run": "pip install abc"
},
"step2": {
"run": "python main.py"
}
}
}
To get started, make sure you have the latest version of the OpenAI library installed:
pip install openai==1.56.2
Now, let’s use the OpenAI Python API to try out Predicted Outputs.
from openai import OpenAI
import time
client = OpenAI()
initial_config = """{
"name": "Example Workflow",
"steps": {
"step1": {
"run": "pip install abc"
}
}
}"""
user_prompt = "Add a new step after step 1 that runs main.py. Respond only with code, and without markdown formatting."
start_ts = time.time()
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": user_prompt
},
{
"role": "user",
"content": initial_config
}
],
prediction={
"type": "content",
"content": initial_config
}
)
end_ts = time.time()
print(f"Executed in {(end_ts - start_ts) * 1000} ms")
In the code above, we define our initial configuration and then prompt the model to add a new step. We explicitly instruct the model to return only code without any markdown formatting.
We pass our message and prediction to the OpenAI API. The prediction is simply the original initial_config.
On my Mac, it took 889 ms to execute the code using Predicted Outputs.
When we run this code and print the assistant’s response, we see that it correctly adds the next step:
print(completion.choices[0].message.content)
{
"name": "Example Workflow",
"steps": {
"step1": {
"run": "pip install abc"
},
"step2": {
"run": "python main.py"
}
}
}
Usage Details
The other important information is contained in the completion.usage object:
CompletionUsage(completion_tokens=56, prompt_tokens=84, total_tokens=140, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=30, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=4), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))
From this, we see that the model generated 56 completion tokens. Of these, 30 tokens came from the predicted content we provided as input. The remaining 26 tokens were newly generated by the model. The model also rejected 4 tokens from the predicted content.
The initial_config JSON accounts for 34 tokens. You can verify this using the OpenAI Tokenizer.
The model was able to reuse tokens corresponding to the following JSON fragment:
{
"name": "Example Workflow",
"steps": {
"step1": {
"run": "pip install abc"
However, it had to discard the last 4 tokens from the initial_config JSON to add the JSON corresponding to step 2.
Running Without Predicted Outputs
Let’s now run the same code without using predictions and compare the latency and token usage.
If you remove the prediction block from the code, the execution time increases to 1835 milliseconds. This shows that Predicted Outputs save nearly a full second, which can significantly improve user experience in interactive applications like coding assistants.
Here’s the completion.usage object without predictions:
CompletionUsage(completion_tokens=51, prompt_tokens=68, total_tokens=119, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))
The model generated 51 completion tokens—5 fewer than with Predicted Outputs. This means using predictions slightly increases token usage due to rejected tokens. Four of these tokens correspond to the rejected text, but the source of the remaining token is unclear.
Summary
| Without Predicted Outputs | With Predicted Outputs | |
|---|---|---|
| Latency | 1835 ms | 889 ms |
| Completion Tokens | 51 | 56 (51 completion + 4 rejected + 1 unknown) |
| Input Tokens | 68 | 84 (source of 16 extra tokens unclear) |
It’s worth noting that users are charged for both the generated completion tokens and rejected tokens. This differs from the expectation that charges would apply only to net new tokens.
Finally, be sure to review the limitations of Predicted Outputs to determine if this feature suits your use case.
Discover more from Shekhar Gulati
Subscribe to get the latest posts sent to your email.