April 24, 202610 min read318 views

Anthropic's Post-Mortem: 3 Bugs That Made Claude Dumber

claude-aianthropicclaude-codeperformancepost-mortem

Introduction

For weeks, Claude users felt like something was off. Responses seemed shallower, code suggestions felt lazier, and sessions that used to flow smoothly started breaking down. The complaints piled up across Reddit, GitHub, and X — and on April 23, 2026, Anthropic finally confirmed what the community had been saying all along.

In a detailed engineering post-mortem, Anthropic identified three separate bugs introduced between March and April that collectively degraded the experience for users of Claude Code, the Claude Agent SDK, and Claude Cowork. The underlying Claude models themselves were not the problem. The bugs lived in the infrastructure and configuration layers surrounding the models.

This article breaks down exactly what went wrong, why it took so long to diagnose, what Anthropic has done to fix it, and what Claude users should know going forward.

The Timeline: How It All Unfolded

The story begins in early March 2026, when Anthropic made what seemed like a reasonable optimization decision. By early April, users were vocal about quality drops. By mid-April, the complaints had reached a crescendo, with an AMD senior director publicly stating on GitHub that Claude had "regressed to the point it cannot be trusted to perform complex engineering." VentureBeat, The Register, Fortune, and Axios all ran stories about the controversy.

Anthropic initially responded cautiously, acknowledging the reports and saying they were investigating. On April 23, the company published its full post-mortem, titled "An update on recent Claude Code quality reports," laying out the three distinct issues in chronological order.

The transparency was notable. Rather than vague corporate language, Anthropic named specific dates, described the technical root causes, and acknowledged that their internal testing had failed to catch the regressions before users did.

Bug One: The Reasoning Effort Downgrade (March 4)

The first change happened on March 4, when Anthropic lowered Claude Code's default reasoning effort from "high" to "medium." The motivation was legitimate — some users experienced latency so severe that the UI appeared frozen, particularly during complex multi-file operations. Reducing the effort level made responses faster.

The problem was that this also made responses noticeably worse. Reasoning effort in Claude's architecture directly controls how much computational budget the model spends thinking through a problem before responding. When you move from high to medium, the model takes fewer reasoning steps, considers fewer edge cases, and produces shallower analysis.

For simple tasks like renaming a variable or writing a quick utility function, the difference was negligible. But for the kinds of tasks that Claude Code's most active users rely on — complex refactors, architectural decisions, debugging subtle race conditions — the drop in quality was immediately felt.

Anthropic reversed this change on April 7, but went even further. The new defaults set reasoning effort to "xhigh" for Opus 4.7 and "high" for all other models. This means Claude Code is now configured to think harder than it did even before the March regression, which should translate to meaningfully better output on complex tasks.

Bug Two: The Caching Error That Made Claude Forget (March 26)

The second bug was more insidious because its effects were harder to pin down. On March 26, Anthropic shipped an optimization designed to help users who returned to Claude Code after long breaks. The idea was straightforward: if a session had been idle for over an hour, clear out the older reasoning traces from the cache to reduce latency when the user resumed.

In theory, this was a good tradeoff. Stale reasoning data from an hour ago is less useful than fresh context, and clearing it would make the session feel snappier when you picked it back up. In practice, a bug in the implementation caused the cache-clearing to trigger not just once when the session resumed, but on every subsequent turn for the rest of the session.

The practical impact was devastating for long coding sessions. Claude would effectively lose track of decisions it had made just minutes earlier. Users reported that Claude would suggest changes, then in the very next turn seem unaware of what it had just done. It would repeat itself, contradict its own recent suggestions, or ask questions that the conversation had already resolved.

This bug specifically affected Sonnet 4.6 and Opus 4.6 users and was fixed on April 10. If you were using Claude Code during this window and felt like you were dealing with a goldfish rather than an AI assistant, this was almost certainly why.

Bug Three: The System Prompt That Backfired (April 16)

The third issue was the most recent and arguably the most avoidable. On April 16, Anthropic updated the system prompt used by Claude Code with instructions designed to reduce verbosity. The goal was to make Claude more concise — less preamble, fewer unnecessary explanations, more direct answers.

Verbosity reduction is a common request from power users. Nobody wants three paragraphs of context before a one-line code fix. But the way this particular prompt change interacted with other existing prompt instructions created an unexpected side effect: Claude started cutting corners on its actual reasoning, not just its communication style.

Instead of producing the same quality output with fewer words, Claude produced lower quality output with fewer words. The instruction to be less verbose apparently interfered with the chain-of-thought patterns that help the model work through complex problems. Anthropic's subsequent ablation tests — where they tested each prompt change in isolation and in combination — revealed a measurable three percent performance drop across both Opus 4.6 and Opus 4.7.

Three percent might not sound like much, but for users who work at the boundary of what Claude can handle, it was the difference between success and frustration. Combined with the other two bugs that had already eroded trust, this third issue was the final straw that pushed the community outcry to mainstream tech press coverage.

This change was reverted on April 20, and Anthropic has indicated that future system prompt modifications will undergo more rigorous ablation testing before deployment.

Why It Took So Long to Diagnose

One of the most frustrating aspects of this saga was the timeline. The first change happened on March 4, but the post-mortem did not arrive until April 23 — nearly seven weeks later. Users were vocal about quality drops for most of that period, and Anthropic's initial responses were cautious and noncommittal.

There are understandable reasons for this. AI quality regressions are notoriously difficult to diagnose because they are inherently subjective. When a traditional software product breaks, there is usually a clear error — a crash, a wrong number, a missing feature. When an AI model gets "worse," the evidence is anecdotal, context-dependent, and hard to reproduce.

Anthropic had to distinguish between several possible explanations: confirmation bias among users who expected degradation after model updates, genuine regressions in model capability, infrastructure changes affecting performance, and normal variance in outputs that users were misattributing to systemic issues. The fact that three separate bugs were contributing simultaneously made the diagnostic process even harder, because fixing one issue did not eliminate all complaints.

Additionally, these bugs affected different user populations differently. The reasoning effort change primarily impacted users doing complex tasks. The caching bug hit users with longer sessions on specific models. The system prompt change affected everyone but was only measurable through careful testing. No single user's experience reflected the full picture.

What Anthropic Did to Make It Right

Beyond fixing the three bugs, Anthropic took several additional steps. All three issues were resolved as of the April 20 release (v2.1.116 of Claude Code). On April 23, alongside the post-mortem publication, Anthropic reset usage limits for all subscribers — effectively giving everyone a fresh allocation to compensate for the degraded experience during the affected period.

The company also committed to process improvements. Future system prompt changes will undergo ablation testing before deployment, comparing model performance with and without each change across a battery of coding benchmarks. Infrastructure changes that affect reasoning parameters will include more comprehensive regression testing.

Perhaps most importantly, Anthropic published a genuinely detailed post-mortem rather than a vague apology. The engineering blog post named specific dates, described exact technical causes, and did not shy away from acknowledging that their testing processes had failed. This level of transparency is unusual in the AI industry and sets a precedent that users will likely hold them to in the future.

What This Means for Claude Users Going Forward

If you stepped away from Claude Code during the worst of the regression, now is a good time to come back. With reasoning effort defaults set higher than ever before, the caching bug resolved, and the problematic system prompt reverted, Claude Code should be performing at its best.

There are a few practical takeaways worth keeping in mind. First, if you are on an older version of Claude Code, update to v2.1.116 or later to ensure you have all three fixes. Second, if you experienced unusual usage limit drain during March or April, check whether your limits have been reset — Anthropic rolled this out to all subscribers on April 23.

Third, and more broadly, this episode highlights the importance of the layers between you and the model. The Claude models themselves — Opus 4.6, Opus 4.7, Sonnet 4.6 — were not degraded. The bugs were in the configuration, caching, and prompting infrastructure. This distinction matters because it means the model capabilities you rely on are stable, but the tools and interfaces that sit on top of them can introduce quality variations that feel like model-level changes.

For API users, this is less of a concern since you control your own system prompts and reasoning parameters. But for users of Claude Code, Cowork, and the Agent SDK, your experience is shaped by Anthropic's configuration choices as much as by the underlying model.

Lessons for the AI Industry

This incident offers broader lessons that extend beyond Anthropic. As AI products become more complex, with multiple layers of prompting, caching, and configuration sitting between the model and the user, the surface area for subtle regressions grows. Traditional software testing catches crashes and errors. AI quality testing needs to catch degradations that might only show up in the long tail of complex use cases.

Anthropic's experience also highlights the tension between optimization and quality. All three changes were motivated by legitimate goals — reducing latency, managing session overhead, improving conciseness. The problem was not bad intentions but insufficient testing of how these changes interacted with real-world usage patterns.

The community response also demonstrated something important: sophisticated users can detect quality regressions faster and more reliably than automated benchmarks. The users who flagged these issues on Reddit and GitHub were running Claude through demanding, real-world tasks that exposed weaknesses no synthetic benchmark would catch. Companies building AI products would be wise to treat user complaints about quality as high-priority signals rather than subjective noise.

Conclusion

Anthropic's April 23 post-mortem marks a turning point in how AI companies communicate about quality issues. Three bugs, introduced over seven weeks, collectively degraded the Claude Code experience for thousands of users. The fixes are in place, limits have been reset, and Anthropic has committed to more rigorous testing.

For users, the key action is simple: update Claude Code and give it another shot. The reasoning effort defaults are now higher than they have ever been, and the infrastructure bugs that caused forgetfulness and shallow responses are resolved.

If you want to keep a close eye on how Claude performs for you day to day — tracking your usage patterns, monitoring which models you are hitting, and making sure you are getting the most out of your subscription — Gaugr gives you real-time visibility into your Claude consumption across all models and tiers.