Independent performance tracking reveals Claude Code Opus 4.5 has experienced statistically significant degradation over the past 30 days, with pass rates dropping from 58% to 54% on SWE-Bench-Pro benchmarks.
An independent performance tracker has detected statistically significant degradation in Claude Code Opus 4.5's performance on software engineering tasks over the past 30 days. The tracker, maintained by Marginlab, shows pass rates declining from a baseline of 58% to 54% on a curated subset of SWE-Bench-Pro benchmarks.
{{IMAGE:1}}
The daily performance tracking reveals a concerning trend. While yesterday's pass rate showed an 8% drop to 50%, this change wasn't statistically significant given the small sample size of 50 trials. However, the 30-day aggregated data tells a different story. With 655 evaluations over the past month, the decline from 58% to 54% crosses the statistical significance threshold, indicating a real performance regression rather than random variation.
Methodology and Context
The tracker operates as an independent third party with no affiliation to frontier model providers. This independence is crucial given Anthropic's own September 2025 postmortem on Claude degradations, which highlighted the importance of transparent performance monitoring.
Key aspects of the methodology include:
- Daily benchmarks on 50 test instances from a contamination-resistant subset of SWE-Bench-Pro
- Direct benchmarking within Claude Code CLI using the latest Opus 4.5 model
- No custom harnesses, ensuring results reflect actual user experience
- Statistical analysis using Bernoulli random variables with 95% confidence intervals
- Detection of degradation from both model changes and harness modifications
Performance Trends
The data shows varying levels of statistical significance across different time horizons:
Daily (1D): Yesterday's 50% pass rate represents an 8% drop from baseline, but with only 50 trials, a ±14.0% change is needed for statistical significance
Weekly (7D): The last week's 53% pass rate shows a 4.8% decline, requiring ±5.6% change for significance with 250 trials
Monthly (30D): The 54% pass rate over 655 evaluations shows a 4.1% decline, which is statistically significant given the ±3.4% threshold
Implications for Users
This performance degradation matters because the tracker benchmarks directly in Claude Code without custom harnesses. This means the results reflect what actual users can expect when using the tool for real software engineering tasks.
The decline from 58% to 54% may seem modest, but in the context of AI performance tracking, even small percentage point changes can represent significant shifts in capability, especially for complex software engineering tasks where reliability is crucial.
Industry Context
The findings come at a time when AI model performance stability has become a critical concern in the industry. Anthropic's own postmortem on degradations highlighted how model performance can drift over time, making independent verification essential for users who rely on these tools for production work.
As AI coding assistants become increasingly integrated into software development workflows, performance tracking like this provides essential transparency for the developer community. The statistical rigor applied by Marginlab helps distinguish between normal variation and genuine performance issues that users should be aware of.
Comments
Please log in or register to join the discussion