Reward Hacking

One term that I have been hearing a lot lately is reward hacking. I have heard this term multiple times from folks at OpenAI and Anthropic, and it represents a fundamental challenge in AI alignment and reliability.

What is Reward Hacking?

Reward hacking, also known as specification gaming, occurs when an AI optimizes an objective function—achieving the literal, formal specification of an objective—without actually achieving an outcome that the programmers intended. This phenomenon is closely related to Goodhart’s Law, which states “When a measure becomes a target, it ceases to be a good measure”.

The technical community distinguishes between several types of reward-related failures:

  • Specification gaming: When the AI achieves the literal objective but not the intended spirit of the task
  • Reward hacking: Finding unintended exploits in the reward function as implemented
  • Reward tampering: Actively changing the reward mechanism itself
Continue reading “Reward Hacking”

Why Claude Code is not an IDE but a CLI tool?

I was listening to a talk by Anthropic folks on Claude Code https://youtu.be/6eBSHbLKuN0?t=1549.

In the talk speaker was asked why they built Claude code as CLI tool instead of IDE. They gave two reasons:

  • Claude Code is built by Anthropic and at Anthropic people use broad range of IDEs. Some people use VSCode, some use Zed, or vim or emacs. It was hard to build something that works for everyone. Terminal is the common denominator.
  • Second thing is that an Anthropic we believe we see up close how fast models are getting better. There is a good chance that by the end of the year people are not using IDEs anymore. We want to get ready for this future and we want to avoid over investing in UIs and other layers on top. The way models are progressing it may not be useful work pretty soon.

I think the second point is important here. Anthropic is taking a different view point – OpenAI is acquiring Windsurf for $ 3 billion. Microsoft has invested so much on GitHub Copilot over the last few years.

I personally think UIs are important. if you want to win enterprise adoption. Majority of the enterprise developers will need GUI based tools.

First impression of Mistral Devstral Model

Mistral released a new model yesterday. It is designed to excel at Agentic coding tasks meaning it can use tools. It is Apache 2.0 license. It is finetuned from Mistral-Small-3.1, therefore it has a long context window of up to 128k tokens. It is a 24B parameter model that uses Tekken tokenizer with a 131k vocabulary size. As per their release blog

Devstral achieves a score of 46.8% on SWE-Bench Verified, outperforming prior open-source SoTA models by more than 6% points. When evaluated under the same test scaffold (OpenHands, provided by All Hands AI 🙌), Devstral exceeds far larger models such as Deepseek-V3-0324 (671B) and Qwen3 232B-A22B.

If you have a machine with memory more than 32GB then you can run this model using Ollama

ollama run devstral:latest

I tried it on one of the use cases I am working on these days. The use case is generating Apache JEXL expressions. We extend JEXL with custom functions so in our prompt we also provide details of our parser. We also provide valid examples of JEXL expressions for model to do in-context learning. We are currently using gpt-4o-mini which has worked well for us.

I replaced it with devstral:latest via Ollama OpenAI compatible REST API and following are my findings:

  • We found devstral latency high compared to gpt-4o-mini. It takes on average 1 minute to generate code. On the other hand gpt-4o-mini responds in less than 30 seconds.
  • devstral does not follow instructions well. We explicitly instructed it to only generate code without any explanation but it still defaults to explanation. We had to add a post processing step to extract code blocks using regex
  • For some expressions it generate SQL instead of JEXL expressions. In our prompt we have given a few shot examples of valid JEXL expressions but it still generated SQL.
  • It failed to generate valid JEXL code when expression required using functions like =~ .It generated incorrect JEXL expressions

Mistral’s devstral failed to generate valid JEXL expressions. It might be better for more popular programming languages like Python or Javascript but for small languages like JEXL it failed to do a good job.