Monkey patching autoevals to show token usage


I use autoevals library to write evals for evaluating output of LLMs. In case you have never written an eval before let me help you understand it with a simple example. Let’s assume that you are building a quote generator where you ask a LLM to generate inspirational Steve Jobs quote for software engineers.

from openai import OpenAI

client = OpenAI()

def quote_generator():
    res = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": f"Generate a Steve Jobs quote that inspires software engieners.",
            }
        ],
        temperature=0.8
    )
    return res.choices[0].message.content

It generates the following quote.

Great software isn’t just about code; it’s about the vision and passion behind it. As software engineers, you have the power to create tools that can change the world. Embrace your creativity, think differently, and remember that the best innovations come from those who dare to dream.

Now, we want to evaluate if the quote is about software engineers and it is inspirational.

This is where autoevals help. You can install autoevals using pip as shown below.

pip install autoevals

We can write an eval as shown below. We are using LLM as a judge approach. We are asking gpt-4omodel to evaluate quote generated by gpt-4o-mini. The autoevals library provide a LLMClassifier that returns a score.

from autoevals import LLMClassifier

prompt = """You are tasked with evaluating whether the provided quote is both inspirational and relevant specifically to software engineers. Here is the data: 
[BEGIN DATA]
************
[Quote]: {{input}} 
************

[END DATA] 

Your evaluation should be based on two factors:

- Inspirational Quality: Does the quote motivate or encourage the reader in a meaningful way?
- Relevance to Software Engineers: Does the quote directly relate to the challenges, mindset, or skills that software engineers typically face?

Please assign a score based on the following guidelines. 

(A) The quote is highly inspirational and highly relevant to software engineers. It strongly motivates and directly resonates with common experiences, challenges, or goals in software engineering.
(B) The quote is somewhat inspirational and somewhat relevant to software engineers. It provides a moderate level of encouragement and ties to the software engineering field, though the connection could be stronger.
(C) The quote is slightly inspirational but lacks clear relevance to software engineers. It may offer general motivational value but does not specifically address engineering concepts, skills, or mindset.
(D) The quote is neither inspirational nor relevant to software engineers. It fails to motivate and does not connect to any aspect of software engineering.
"""

evaluator = LLMClassifier(
    name="Quote Eval",
    prompt_template=prompt,
    choice_scores={"A": 1, "B": 0.7, "C": 0.4, "D": 0},
    use_cot=True,
    model="gpt-4o",
    temperature=0
)
score = evaluator(input=quote, output="")

I generated the above eval prompt using LLM as well. LLMs are good at generating eval prompts.

The score is JSON format is shown below. The score is 1 for this quote. As you can see the generateed quote is both inspirational and targets software engieners.

{
    "score": 1,
    "metadata": {
        "choice": "A",
        "rationale": "1. Inspirational Quality: The quote emphasizes the power of creativity and the potential impact of software engineering on the world. It encourages engineers to embrace their creativity and think differently, which is a strong motivational message. The idea of shaping the future and changing the world is highly inspirational.\\n\\n2. Relevance to Software Engineers: The quote directly addresses software engineers by mentioning the act of writing code and crafting experiences. It highlights the unique ability of engineers to shape the future, which is a common theme in the field. The mention of every line of code having the potential to change the world resonates with the challenges and goals of software engineers.\\n\\n3. Conclusion: The quote is both highly inspirational and highly relevant to software engineers. It strongly motivates and directly resonates with common experiences, challenges, and goals in software engineering. Therefore, the appropriate choice is (A)."
    }
}

If we use a different prompt like the one by Albert Einstein

Two things are infinite: the universe and human stupidity; and I’m not sure about the universe.

Then, we get score as 0.4 and choce as C.

{
    "score": 0.4,
    "metadata": {
        "choice": "C",
        "rationale": "1. Inspirational Quality: The quote by Albert Einstein is often considered thought-provoking and humorous. It highlights the vastness of human ignorance in a light-hearted way. While it may inspire some to reflect on human nature and the pursuit of knowledge, it doesn't provide a direct motivational push or encouragement.\\n\\n2. Relevance to Software Engineers: The quote does not specifically address any aspect of software engineering. It doesn\'t relate to the skills, challenges, or mindset specific to software engineers. It is more of a general commentary on human nature and the universe.\\n\\n3. Conclusion: Given the lack of direct relevance to software engineering and only a slight inspirational quality, the quote does not strongly motivate or resonate with software engineers. Therefore, it is best categorized as slightly inspirational but lacking clear relevance to software engineers."    
    }
}

Since I am using LLM as judge I wanted to capture the token usage of the underlying LLM. The autoevals library do not expose this information. I opened an issue and autoevals suggested that we can do it via mokey patching autoevals library.

Monkey patching in Python refers to the practice of dynamically modifying or extending a class or module at runtime. It allows developers to change or add functionality to existing classes or functions without modifying the original source code. This is often done to fix bugs, alter behavior, or extend functionality in third-party libraries or frameworks.

I had to monkey patch _postprocess_response method of OpenAILLMClassifier The LLMClassifier extends OpenAILLMClassifier class.

In my code I created a new method with same signature

def _new_postprocess_response(self, resp):
    if len(resp["choices"]) > 0:
        score = self._process_response(resp["choices"][0]["message"])
        score.metadata['usage'] = resp['usage']
        return score
    else:
        raise ValueError("Empty response from OpenAI")

Then, after creating the instance of

evaluator = LLMClassifier(
    name="Quote Eval",
    prompt_template=prompt,
    choice_scores={"A": 1, "B": 0.7, "C": 0.4, "D": 0},
    use_cot=True,
    model="gpt-4o",
    temperature=0
)
evaluator._postprocess_response = _new_postprocess_response.__get__(evaluator)
score = evaluator(input=quote, output="")

Now when I print score it also has LLM token usage.

{
    "score": 1,
    "metadata": {
        "choice": "A",
        "rationale": "...",
        "usage": {
            "completion_tokens": 171,
            "prompt_tokens": 455,
            "total_tokens": 626,
            "prompt_tokens_details": {
                "cached_tokens": 0
            },
            "completion_tokens_details": {
                "reasoning_tokens": 0
            }
        }
    }
}

Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.

One thought on “Monkey patching autoevals to show token usage”

Leave a reply to Selva Cancel reply