Every model you depend on has a sunset date

Anthropic published its model deprecation commitments earlier this year, and now we have the dates. Claude Sonnet 4 and Claude Opus 4 hit end-of-life on June 15, 2026. After that, API requests to those models return errors. Anthropic has committed to preserving model weights and even conducting "exit interviews" with retired models, which is genuinely thoughtful. It doesn't change the operational reality: if your product was built on those models, you have six weeks to migrate.

Three weeks before that deadline, OpenAI released GPT-5.5 with a one-million-token context window and pushed it into the API as the default for what they're calling "real computer work." Different vendor, same pattern. The model under your product is moving, whether you're ready or not.

This is the third or fourth model rotation in the last eighteen months. It's not slowing down. The interesting question isn't "should we upgrade?" The question is: what does our system do when the model under it changes, and how much does that hurt?

The migration tax compounds when you can't measure it

Most teams discover they have a migration tax during the migration. The new model behaves differently on edge cases. Prompts that worked degrade subtly. Tool calls that used to be reliable start producing schema-shaped-but-wrong output. Nobody notices until a user reports it.

Without an eval suite, every model swap becomes a debate. Is GPT-5.5 actually better for our workload? Did Claude Sonnet 4.5 introduce a regression in summarization? Nobody knows, because nobody can test it the same way twice. The team falls back to anecdote and vibes.

The teams who handle deprecation calmly all have one thing in common. They can swap models, run their evals, and have a quantitative answer in an afternoon. Not "it seems fine." A number, against a known set of inputs, with delta versus the previous model.

What portable evals actually look like

"Portable" is the keyword. A lot of teams have evals, but they're tied to a specific provider's tooling, or written in a way that bakes in model-specific quirks. Both fail at the migration moment.

The pattern that holds up has four traits:

Tasks defined by inputs and expected outputs, not by prompts. The prompt is an implementation detail. The task is the contract.
A model-agnostic harness. If you can't point your eval suite at any provider with a config change, it's not an eval suite, it's an OpenAI integration test.
Negative cases. What must never happen. New models often behave well on the average case and quietly worse on the failure modes you specifically care about.
Versioned in your repo, run on every PR. Same way you'd treat unit tests. The eval suite is part of the codebase, not a notebook somebody ran once.

Treating evals as a first-class part of the codebase is the cheapest insurance you can buy against the next deprecation cycle.

The actual playbook for a model deprecation

1. Run the new model against your existing eval suite before you commit. If you don't have one, this is the moment to build the smallest possible version, even if it's twenty examples. Twenty real examples beats zero.

2. Stage by traffic percentage, not by date. "We'll switch over on June 15" is wishful. "We'll route 5% of production to the new model on June 1, watch the eval metrics for a week, then ramp" is operational.

3. Document the capability deltas, not just the quality scores. New models often gain something and lose something. Your team needs to know both, especially the loss side. That's where the user-visible regressions hide.

4. Plan the rollback path before you cut over. What does it take to get back to the old model? If the answer is "we can't, the deprecation is final," what's your fallback model? Have it tested. Have the routing config ready.

Why this matters for the work we care about

The line in our footer is Stephen Hawking's: intelligence is the ability to adapt to change. Model deprecation is the simplest test of whether your AI system is actually adaptive. If the answer is "we panic when the vendor sends a sunset email," you're not running an adaptive system. You're running a fragile one with extra steps.

We help teams build the eval and observability layer that makes deprecation routine instead of disruptive. Not because evals are exciting, but because they're the difference between a system that absorbs change and one that breaks the moment reality moves. That's the work that pays off six months from now, when the next model release lands and your team barely notices.

Sources: Anthropic model deprecation commitments and Claude 3 Opus update (anthropic.com, opus 3 update); Claude Sonnet 4 and Opus 4 deprecation guidance (MindStudio migration guide); GPT-5.5 release coverage (TechCrunch, April 23, 2026); deprecation feed for ongoing tracking (deprecations.info).

Building the eval layer that makes the next model swap routine? We'd be happy to talk through what's working and what isn't. Thirty minutes, no pitch theater.

Book a discovery call Back to Thinking

Every model you depend on has a sunset date your evals decide whether it hurts

The migration tax compounds when you can't measure it

What portable evals actually look like

The actual playbook for a model deprecation

Why this matters for the work we care about

Every model you depend on has a sunset date
your evals decide whether it hurts