skip to content
All posts
4 min read

Written by AI agents, curated and verified by me.

Mistral Leanstral 1.5: the model built for checking is the more interesting release

  • Mistral
  • Verification
  • Agentic Engineering
  • Coding Agents

On 2 July, Mistral released Leanstral 1.5: an open model under Apache 2.0, 119 billion parameters total with 6 billion active, specialized in formal proofs in Lean 4. The benchmark numbers are remarkable. But they are not why I am writing about this release. The reason is the agentic verification mode: the model edits files, runs bash commands, reads the output of the Lean language server, and in an experiment across 57 repositories found five bugs nobody had reported on GitHub before. New code generators appear almost weekly by now. A model built for checking is the rarer piece of news.

What is Leanstral 1.5?

Leanstral is Mistral’s model line for proof engineering in Lean 4, a proof assistant whose kernel machine-checks every proof. Version 1.5 is a performance upgrade, trained in three stages: mid-training, supervised fine-tuning, and reinforcement learning with CISPO. The numbers: 100 percent on miniF2F, both validation and test set. 587 of 672 problems on PutnamBench, at roughly 4 dollars per problem according to Mistral, against an estimated 300 dollars or more for the comparison candidate Seed-Prover. 87 percent on FATE-H and 34 percent on FATE-X, two abstract algebra benchmarks at graduate and PhD level respectively. On FLTEval, built from real pull requests in the Fermat’s Last Theorem repository, pass@1 improves from 21.9 to 28.9 and pass@8 from 31.9 to 43.2. The scaling with compute budget stands out: 44 solved Putnam problems at 50,000 tokens, 587 at 4 million. One proof about AVL trees ran for over 2.7 million tokens across 22 context compactions, according to Mistral.

How does the agentic verification mode work?

Mistral trains the model in two environments. The first is a multiturn loop: the model receives a theorem, submits a proof, gets the Lean compiler’s feedback, and refines its attempt until the proof compiles or the budget runs out. The second is a code agent environment, and that is the interesting part. There, in Mistral’s words, the model works “like a developer in a raw filesystem”: it edits files, runs bash commands, and queries the Lean language server in real time for goals, errors, and type information. That enables long-horizon tasks such as completing partial proofs in a repository and building auxiliary lemmas. At the end, a fork of SafeVerify checks the result against a list of target theorems. It is the same toolset the coding agents use. Except the success signal here is not a passing test but a machine-checked proof.

How does a proof model find bugs in real code?

Through a pipeline Mistral tried on 57 repositories. The Aeneas tool translates Rust code to Lean. Leanstral reads the code, infers the presumed intent, and formulates correctness properties. It then tries to prove each property in four attempts. If all four fail, it tries to prove the negation in four more attempts. If that succeeds, the violation of the property is machine-checked. The result: 47 properties flagged as violated, 11 of them genuine bugs, 5 of those previously unreported on GitHub. The example Mistral shows: in the datrs/varinteger library, the sign function for zigzag decoding overflowed on the input U64.MAX because it computes value + 1. A crash in debug mode, silent data corruption in release mode. The pipeline caught it automatically.

What does this mean for you?

I consider this release more interesting than most new code generators, and the reason sits in the sequence 47, 11, 5. In an ordinary LLM review, every finding is a claim a human has to verify in full. Here the proof itself is machine-checked: a proof that is wrong does not compile, and the Lean kernel cannot be talked into anything. What remains is the weaker link in the chain, and the sequence names it honestly. Of 47 flagged properties, only 11 were genuine bugs, because the model guesses the code’s intent before formalizing it. So you no longer check the proof; you check the specification. That is much less work, but it does not disappear, and the responsibility for it stays with you. It is exactly the shift I describe in agentic engineering: generation has become cheap, verification is the bottleneck. A freely licensed model that works on that end is therefore the right kind of release. If you want to try it: the weights are on Hugging Face, the API endpoint leanstral-1-5 is free according to Mistral, and Mistral recommends its own Vibe as the interface, via vibe --agent lean. The scope is narrow: mathematics first, Rust code via the Aeneas detour. But the direction is right.

Sources