GitHub's March 2026 Availability Report: Four Major Incidents Analyzed

GitHub experienced four significant service disruptions in March 2026, affecting core services like github.com, API, Actions, and Copilot. The incidents ranged from caching issues to Redis misconfigurations and authentication problems, with GitHub committing to immediate fixes and long-term architectural improvements.

GitHub has published its March 2026 availability report, detailing four significant incidents that impacted various services across the platform. The report, authored by Jakub Oleksy, provides transparency into the root causes, impacts, and mitigation strategies for each incident, along with GitHub's commitment to improving system resilience.

Incident 1: Caching Mechanism Failure (March 3, 2026)

The first incident occurred on March 3, lasting 1 hour and 10 minutes, affecting github.com, the GitHub API, GitHub Actions, Git operations, and GitHub Copilot. At the peak of the incident:

GitHub.com request failures reached approximately 40%
GitHub API requests failed at a rate of 43%
Git operations over HTTP experienced a 6% error rate
GitHub Copilot requests failed at approximately 21%
GitHub Actions saw less than 1% impact

The root cause was traced back to a bug in a deployment aimed at reducing writes to the user settings caching mechanism. The bug caused every user's cache to expire simultaneously, triggering a cascade of recalculations and rewrites that overwhelmed the system. This incident shared similarities with an earlier February incident involving the same caching mechanism.

Immediate Actions Taken:

Immediate rollback of the faulty deployment
Addition of a killswitch and improved monitoring to the caching mechanism
Migration of the cache mechanism to a dedicated host to isolate future issues

Incident 2: Redis Infrastructure Misconfiguration (March 5, 2026)

On March 5, GitHub Actions experienced a 2 hour and 55 minute degradation. The incident was triggered by Redis infrastructure updates being rolled out to improve resiliency. These updates introduced incorrect configuration changes to the Redis load balancer, causing internal traffic to be routed to incorrect hosts.

During this incident:

95% of workflow runs failed to start within 5 minutes, with an average delay of 30 minutes
10% of workflow runs failed with infrastructure errors

The mitigation involved correcting the misconfigured load balancer. Actions jobs began running successfully at 17:24 UTC, with the remaining time spent clearing the job queue backlog.

Immediate Actions Taken:

Immediate rollback of the Redis updates
Freezing all changes in the affected area until follow-up work is complete
Improvements to automation to prevent incorrect configuration changes from propagating
Enhanced alerting for misconfigured load balancers
Updates to Redis client configuration in Actions for better resiliency

Incident 3: Copilot Coding Agent Authentication Issues (March 19-20, 2026)

The Copilot Coding Agent service experienced two separate degradations caused by the same underlying authentication issue. The first incident occurred on March 19 between 01:05 and 02:52 UTC, and the second on March 20 between 00:42 and 01:58 UTC.

During the first incident:

Average error rate was approximately 53%
Peak error rate reached approximately 93% of requests

During the second incident:

Average error rate was approximately 99%
Peak error rate reached approximately 100% of requests
Significant retry amplification occurred

Both incidents were caused by a system authentication issue that prevented the service from connecting to its backing datastore. The mitigation for each incident involved rotating the affected credentials, which restored connectivity.

Immediate Actions Taken:

Implementation of automated monitoring for credential lifecycle events
Improvements to operational processes to reduce time to detection and mitigation

Incident 4: Microsoft Teams Integration Outage (March 24, 2026)

The final incident on March 24 affected the Microsoft Teams Integration and Teams Copilot Integration services. The 2 hour and 52 minute outage prevented GitHub event notifications from being delivered to Microsoft Teams.

During this incident:

Average error rate was 37.4%
Peak error rate reached 90.1% of requests
Approximately 19% of all integration installs failed to receive notifications

The root cause was identified as an outage at one of GitHub's upstream dependencies, which caused HTTP 500 errors and connection resets for the Teams integration.

Immediate Actions Taken:

Coordination with relevant service teams
Updates to observability and runbooks to reduce time to mitigation for similar issues

GitHub's Commitment to Improvement

In the report, GitHub acknowledges that while substantial investments have been made in building and operating the platform to improve resilience, more work remains to be done. The company emphasizes that achieving greater reliability requires both deep architectural work that is already underway and urgent, targeted improvements.

GitHub has committed to several immediate steps across all incidents, including enhanced monitoring, improved automation, better credential management, and updated operational procedures. The company also encourages users to follow their status page for real-time updates and post-incident recaps.

This transparency in reporting demonstrates GitHub's commitment to open communication with its developer community, even when incidents occur. By providing detailed root cause analyses and clear action plans, GitHub aims to maintain trust while working toward a more resilient platform.

For developers relying on GitHub services, this report serves as both a status update and a roadmap for expected improvements in system reliability and incident response times.