skip to content
All posts
2 min read

Claude Opus 4.8: not the benchmarks, the review

  • Agentic Engineering
  • Claude
  • Verification
  • Coding Agents
  • Automation

Anthropic has released Claude Opus 4.8, an update to Opus 4.7. The headlines are the usual ones: beats GPT-5.5 on the Super-Agent benchmark, 84 percent on Online-Mind2Web (browser control), first model above 10 percent on the strict all-pass standard of the Legal Agent Benchmark. I read benchmarks, but for my work they decide little. One number in the announcement is different: Opus 4.8 lets code flaws pass unremarked about four times less often. That is the point.

What is new, in plain terms

The price stays the same: 5 dollars per million input tokens, 25 for output, and 10 and 50 in fast mode. On the API it runs as claude-opus-4-8, available in claude.ai, Claude Code, and the API. Three things stand out:

  • Dynamic workflows in Claude Code (Enterprise, Team, Max): hundreds of subagents run in parallel on large tasks.
  • Effort control in claude.ai and Cowork: you choose how much effort the model spends on a reply, traded against speed.
  • Messages API: system entries can now arrive mid-task without breaking the prompt cache.

Plus better efficiency on multimodal input (around 61 percent cheaper token cost than 4.7) and, per Anthropic, new highs on prosocial behavior and noticeably less misalignment.

Why the overlooked flaws are the real news

For me, what makes the difference with AI has always been the review, not the model. That is exactly where 4.8 lands. A model that reads over a flaw less often raises the floor a review stands on. That is worth more than one more point in some ranking.

But four times less often is not zero. “Rarely overlooks” quickly turns into “I stop looking myself”, and that is exactly when the progress flips into its opposite. The model reviews better. A human still has to own it.

Hundreds of subagents, the same old question

The dynamic workflows are, at their core, a loop, only bigger than usual: hundreds of agents on one task. It is impressive, and it sharpens a familiar problem. The more code appears in parallel that nobody wrote themselves, the faster the gap grows between what sits in the repo and what you have actually grasped. More agents do not solve that, they widen it.

Take the better tool, still read what it builds

Is the update worth it? Yes, I use it. The sharper judgment on agentic tasks and the lower error rate are exactly the improvements that matter day to day, not the ranking against GPT-5.5. Benchmarks age in weeks. What remains is the question of whether your software still holds up in a year. No model answers that. You do.

Sources