skip to content
All posts
3 min read

GPT-5.5: more autonomy does not mean less checking

  • OpenAI
  • Coding Agents
  • Verification

On 23 April, OpenAI released GPT-5.5, by its own account its most capable and most intuitive model yet. The interesting part is less the headline than the claim behind it: you are meant to hand the model a messy, multi-part task and trust it to plan, use tools, check its own work, and keep going until the task is done. That is exactly where a sober look pays off. Capability rises, responsibility stays.

What does GPT-5.5 do?

GPT-5.5 is stronger above all at agentic coding, computer use, and knowledge work. The benchmarks back this up: 82.7 percent on Terminal-Bench 2.0, 58.6 percent on SWE-Bench Pro, 84.9 percent on GDPval, 78.7 percent on OSWorld-Verified. What stands out is not only the level but the efficiency. According to OpenAI the model reaches these figures at the same per-token latency as GPT-5.4 and needs noticeably fewer tokens for the same Codex tasks. In ChatGPT and Codex it is available now to Plus, Pro, Business, and Enterprise users, alongside GPT-5.5 Pro for the harder cases and GPT-5.5 Thinking in ChatGPT. It is coming to the API soon, announced at 5 dollars per million input and 30 dollars per million output tokens, with a context window of one million tokens. In Codex it is 400,000.

Why the longer autonomy is the real story

The real jump is in stamina. According to OpenAI, GPT-5.5 holds up across longer, multi-step tasks, plans, corrects itself, and stops early less often. That is useful. But it moves the point at which a mistake shows up. A model working on its own for an hour makes dozens of small decisions in that hour that no one is reading along with. That it checks its own work helps, but it is not the same as a check by someone who is accountable for it. Self-monitoring and verification are two different things. The longer the autonomous run, the more expensive the one unchecked step at the end.

Reliability lives in the architecture

Reliability here does not come from the model, but from what you build around it. Define where a human signs off: before the merge, before the send, before the booking. Let the model run the long, tedious stretch, and pull the decisions with consequences back out. That split is the core of agentic work.

More capability, more control

OpenAI rates GPT-5.5 as High in biology and chemistry and in cybersecurity under its own Preparedness Framework, and ships what it calls its strongest set of safeguards to date, tested with nearly 200 early partners. For cyber that means stricter classifiers, which may block legitimate requests at first. That is the right direction. If you expand capability, you have to expand control with it.

Where GPT-5.5 earns its place

The progress is real, especially in coding and in efficiency. Use it. Hand the model the long, tedious tasks. But keep your hand on the points where an unnoticed step costs money, code, or trust. More autonomy does not mean less checking, it means a more deliberate one.

Sources