GitHub experienced four significant service disruptions in March 2026, affecting core services like github.com, API, Actions, and Copilot. The incidents ranged from caching issues to Redis misconfigurations and authentication problems, with GitHub committing to immediate fixes and long-term architectural improvements.
GitHub has published its March 2026 availability report, detailing four significant incidents that impacted various services across the platform. The report, authored by Jakub Oleksy, provides transparency into the root causes, impacts, and mitigation strategies for each incident, along with GitHub's commitment to improving system resilience.
Incident 1: Caching Mechanism Failure (March 3, 2026)
The first incident occurred on March 3, lasting 1 hour and 10 minutes, affecting github.com, the GitHub API, GitHub Actions, Git operations, and GitHub Copilot. At the peak of the incident:
- GitHub.com request failures reached approximately 40%
- GitHub API requests failed at a rate of 43%
- Git operations over HTTP experienced a 6% error rate
- GitHub Copilot requests failed at approximately 21%
- GitHub Actions saw less than 1% impact
The root cause was traced back to a bug in a deployment aimed at reducing writes to the user settings caching mechanism. The bug caused every user's cache to expire simultaneously, triggering a cascade of recalculations and rewrites that overwhelmed the system. This incident shared similarities with an earlier February incident involving the same caching mechanism.
Immediate Actions Taken:
- Immediate rollback of the faulty deployment
- Addition of a killswitch and improved monitoring to the caching mechanism
- Migration of the cache mechanism to a dedicated host to isolate future issues
Incident 2: Redis Infrastructure Misconfiguration (March 5, 2026)
On March 5, GitHub Actions experienced a 2 hour and 55 minute degradation. The incident was triggered by Redis infrastructure updates being rolled out to improve resiliency. These updates introduced incorrect configuration changes to the Redis load balancer, causing internal traffic to be routed to incorrect hosts.
During this incident:
- 95% of workflow runs failed to start within 5 minutes, with an average delay of 30 minutes
- 10% of workflow runs failed with infrastructure errors
The mitigation involved correcting the misconfigured load balancer. Actions jobs began running successfully at 17:24 UTC, with the remaining time spent clearing the job queue backlog.
Immediate Actions Taken:
- Immediate rollback of the Redis updates
- Freezing all changes in the affected area until follow-up work is complete
- Improvements to automation to prevent incorrect configuration changes from propagating
- Enhanced alerting for misconfigured load balancers
- Updates to Redis client configuration in Actions for better resiliency
Incident 3: Copilot Coding Agent Authentication Issues (March 19-20, 2026)
The Copilot Coding Agent service experienced two separate degradations caused by the same underlying authentication issue. The first incident occurred on March 19 between 01:05 and 02:52 UTC, and the second on March 20 between 00:42 and 01:58 UTC.
During the first incident:
- Average error rate was approximately 53%
- Peak error rate reached approximately 93% of requests
During the second incident:
- Average error rate was approximately 99%
- Peak error rate reached approximately 100% of requests
- Significant retry amplification occurred
Both incidents were caused by a system authentication issue that prevented the service from connecting to its backing datastore. The mitigation for each incident involved rotating the affected credentials, which restored connectivity.
Immediate Actions Taken:
- Implementation of automated monitoring for credential lifecycle events
- Improvements to operational processes to reduce time to detection and mitigation
Incident 4: Microsoft Teams Integration Outage (March 24, 2026)
The final incident on March 24 affected the Microsoft Teams Integration and Teams Copilot Integration services. The 2 hour and 52 minute outage prevented GitHub event notifications from being delivered to Microsoft Teams.
During this incident:
- Average error rate was 37.4%
- Peak error rate reached 90.1% of requests
- Approximately 19% of all integration installs failed to receive notifications
The root cause was identified as an outage at one of GitHub's upstream dependencies, which caused HTTP 500 errors and connection resets for the Teams integration.
Immediate Actions Taken:
- Coordination with relevant service teams
- Updates to observability and runbooks to reduce time to mitigation for similar issues
GitHub's Commitment to Improvement
In the report, GitHub acknowledges that while substantial investments have been made in building and operating the platform to improve resilience, more work remains to be done. The company emphasizes that achieving greater reliability requires both deep architectural work that is already underway and urgent, targeted improvements.
GitHub has committed to several immediate steps across all incidents, including enhanced monitoring, improved automation, better credential management, and updated operational procedures. The company also encourages users to follow their status page for real-time updates and post-incident recaps.
This transparency in reporting demonstrates GitHub's commitment to open communication with its developer community, even when incidents occur. By providing detailed root cause analyses and clear action plans, GitHub aims to maintain trust while working toward a more resilient platform.
For developers relying on GitHub services, this report serves as both a status update and a roadmap for expected improvements in system reliability and incident response times.

Comments
Please log in or register to join the discussion