Mistral released a new model yesterday. It is designed to excel at Agentic coding tasks meaning it can use tools. It is Apache 2.0 license. It is finetuned from Mistral-Small-3.1, therefore it has a long context window of up to 128k tokens. It is a 24B parameter model that uses Tekken tokenizer with a 131k vocabulary size. As per their release blog
Devstral achieves a score of 46.8% on SWE-Bench Verified, outperforming prior open-source SoTA models by more than 6% points. When evaluated under the same test scaffold (OpenHands, provided by All Hands AI 🙌), Devstral exceeds far larger models such as Deepseek-V3-0324 (671B) and Qwen3 232B-A22B.
If you have a machine with memory more than 32GB then you can run this model using Ollama
ollama run devstral:latest
I tried it on one of the use cases I am working on these days. The use case is generating Apache JEXL expressions. We extend JEXL with custom functions so in our prompt we also provide details of our parser. We also provide valid examples of JEXL expressions for model to do in-context learning. We are currently using gpt-4o-mini which has worked well for us.
I replaced it with devstral:latest via Ollama OpenAI compatible REST API and following are my findings:
- We found devstral latency high compared to gpt-4o-mini. It takes on average 1 minute to generate code. On the other hand gpt-4o-mini responds in less than 30 seconds.
- devstral does not follow instructions well. We explicitly instructed it to only generate code without any explanation but it still defaults to explanation. We had to add a post processing step to extract code blocks using regex
- For some expressions it generate SQL instead of JEXL expressions. In our prompt we have given a few shot examples of valid JEXL expressions but it still generated SQL.
- It failed to generate valid JEXL code when expression required using functions like
=~ .It generated incorrect JEXL expressions
Mistral’s devstral failed to generate valid JEXL expressions. It might be better for more popular programming languages like Python or Javascript but for small languages like JEXL it failed to do a good job.