AI Weekly: The Week Models Crossed a Line

This Week in AI

This was a week of capability milestones — and one very deliberate restraint. Models crossed benchmarks that felt abstract six months ago, and one lab decided that “more capable” doesn’t always mean “ship it.” A lot to unpack.

The Big Story

Anthropic finished training Claude Mythos — described internally as their most capable model ever — and then quietly refused to release it. During testing, Mythos independently identified thousands of zero-day vulnerabilities across major operating systems and browsers, triggering Anthropic’s ASL-4 safety protocol. No public release date has been given. It’s the first time a frontier lab has publicly acknowledged sitting on a finished model for safety reasons. Whether you find that reassuring or alarming probably says something about where you stand on this whole AI moment we’re living through.

Also Worth Knowing

GPT-5.4 hits 75% on desktop task benchmarks. OpenAI’s new flagship is the first general-purpose model with native computer-use baked in — it can operate your desktop, run workflows across apps, and has officially cleared the human-level bar on OSWorld-Verified. The agentic era isn’t coming. It’s here.
Claude Opus 4.7 is out. Launched April 16, it’s now the default Opus across Claude products, the API, Bedrock, and Vertex. Anthropic calls it the most practical upgrade since Claude 4. Worth testing if you’re running anything production on 4.6.
Claude Design launched as a standalone product. Announced April 17, it takes text prompts and outputs slide decks, one-pagers, UI mockups, and dashboards — exportable to PDF, PPTX, Canva, or HTML. Early days, but this is clearly aimed at the “I just need to make a thing” professional.
PwC: 20% of companies capturing 75% of AI’s gains. The companies winning with AI aren’t just using more tools — they’re using AI to chase new revenue, not just cut costs. The gap between leaders and everyone else is widening fast.

What I’m Watching

The Mythos story will develop. Other labs will face this same decision eventually — and they won’t all make the same call. Keep an eye on how the industry responds when a competitor openly sits on capability for safety reasons.