GitHub's May 2026 Availability Report: Nine Incidents and the Engineering Behind the Fixes

GitHub logged nine degradation incidents in May 2026, ranging from a 32-bit integer key exhausting its ID space to an automated account-review system suspending the very service account that authenticates Actions. The report doubles as a progress check on GitHub's larger migration off its monolith and onto Azure, and the root causes read like a field guide to the failure modes that haunt large-scale systems.

GitHub published its May 2026 availability report on June 11, covering nine separate incidents that degraded performance across the platform. Read individually, each is a contained outage with a clear cause and a fix. Read together, they form a useful catalog of the failure modes that show up when a system grows faster than its original architecture assumed it would.

The framing matters here. GitHub opens the report not with the incidents but with infrastructure progress, and the numbers explain why. Traffic is climbing fast, pushed by AI-assisted and agentic development workflows, and the company is in the middle of restructuring how the platform runs. The stated priority order is blunt: availability, then capacity, then features. Most of May's incidents trace back to the seams created by that work in progress.

The migration context

GitHub reports it is now serving 40% of monolith traffic from Azure, up from 8% in February. Git traffic sits at 30%, and repository replication is at 99%. The company says it has more than doubled effective capacity in four months.

The more interesting structural change is the decomposition of the primary database cluster. GitHub is splitting users, authentication, and authorization into independent domains so a problem in one cannot cascade into the others. The new users service has fully cut over and is reportedly handling double the traffic at lower database cost. Stateless authentication tokens are also rolling out, which removes a per-request database lookup that previously amplified load during traffic spikes.

That last detail is worth holding onto, because several May incidents are textbook cases of shared dependencies turning a local problem into a platform-wide one. The whole point of the isolation work is to make those cascades impossible. The May incidents show why the work is necessary and that it is not finished yet.

When a routine migration saturated the database

The May 4 incident is the clearest example of a shared-dependency cascade. For about an hour and six minutes, github.com served elevated latency and an increased rate of 5xx errors. Pull requests went Red. Issues, Actions, webhooks, and Git operations degraded, and a long list of dependent services including Codespaces, Pages, Packages, OAuth, Marketplace, and Copilot saw knock-on effects because they share data dependencies. At peak, roughly 1.3% of requests returned a 5xx, averaging about 0.46% across the incident.

The trigger was an online schema migration running against a large, heavily-accessed table. It had been progressing cleanly for hours. The problem arrived when normal traffic ramped toward the weekly peak and the combined load from the migration plus production traffic saturated the database connection capacity. That produced query contention on a primary and cascading timeouts everywhere downstream.

The failure shape is familiar to anyone who has run schema changes against a live system. The migration is not expensive in isolation, and the traffic is not a problem in isolation. The two together cross a threshold that neither would reach alone. The fix set GitHub described is about making that interaction visible and self-limiting: aligning migrations against large tables with low-traffic windows, adding dynamic throttling that adapts to live cluster load, and installing automated circuit breakers that pause an in-flight migration when latency or connection utilization crosses a safe threshold. They are also reviewing connection-pool headroom so migrations have room to run.

The pattern repeats. On May 7, follow-up recovery work from a separate pull request incident involved a large migration that caused replication lag on several replica hosts. Those replicas were not serving user traffic, but GitHub's safeguards correctly read the elevated lag as a reason to slow writes to the cluster. That slowdown delayed background pull request processing, which is responsible for emitting the internal events Copilot agents consume to begin work. New Copilot coding agent sessions stopped starting and review agent sessions dropped by about half until replication caught up. A protective mechanism worked exactly as designed and still produced customer impact through a dependency two hops away.

The 32-bit key that ran out

The May 6 review-thread incident is the kind of bug that is obvious in hindsight and nearly invisible before it fires. For nearly four hours, creating new pull request review threads failed at close to a 100% rate. New line comments and file comments on pull requests would not save. Existing comments were fine.

The cause was a 32-bit integer key reaching its maximum value in a Vitess lookup table used during thread creation. The primary table had already been migrated to a 64-bit key. The Vitess lookup table that maps to it had not. Once the primary table's IDs passed the top of the 32-bit space, new thread creation had nowhere to write the lookup entry and failed outright.

This is the integer-overflow failure mode applied to identifier space rather than arithmetic. A signed 32-bit integer tops out at 2,147,483,647, and a system that mints monotonically increasing IDs will eventually reach that ceiling no matter how comfortable it once looked. The migration to 64-bit keys was the right move; the gap was that it covered the primary table and missed a secondary structure that had to agree with it. GitHub mitigated by widening the lookup table column to 64-bit across all shards, and the durable fix is expanding column-size monitoring to include Vitess lookup tables so a column approaching its limit is flagged before it exhausts.

The broader lesson is about coupling. When two tables must hold values from the same space, they have to be migrated as a unit. A 64-bit primary key paired with a 32-bit lookup key is a contradiction that simply waits for traffic to expose it.

Remediation that caused the next outage

The May 5 and May 6 Actions incidents are linked in an uncomfortable way: the remediation for the first introduced the configuration problem that caused the second.

On May 5, hosted runners in East US degraded. A routine scale-up operation for runner VMs hit an internal rate limit while pulling images from storage. The existing backoff logic did not engage, because the response code returned in that case was not one it recognized as a retry signal. About 13.5% of standard runner jobs failed, along with roughly 16% of larger runners pinned to East US for private networking. Most requests rerouted to other regions automatically, but the portion still landing in East US suffered. Around 8,500 Copilot code review requests timed out, surfacing as error comments that users could retry.

The remediation introduced configuration data that then blocked new allocations as daily load ramped the next morning, producing the May 6 incident in which about 17.1% of standard runner jobs failed. GitHub removed the offending data and allocations resumed. The follow-ups target both ends: better throttling behavior when limits are hit, filter logic that tolerates abnormal data shapes, and monitoring that alerts when allocations are blocked rather than waiting for failed jobs to reveal it.

The backoff detail is the instructive part. Backoff logic that keys on specific response codes is only as good as its coverage of the codes it might actually see. An unhandled status that means "slow down" reads as "unknown," and the system charges ahead into the exact wall the backoff existed to avoid.

Configuration changes and the account that suspended itself

Two more incidents come down to configuration and automation acting without guardrails.

On May 6, a configuration change to network routing inadvertently removed the ingress path for the Copilot session service. Every request to the session API failed for the duration. The change was reverted within eleven minutes, and the follow-up is improved deployment validation to catch a change that removes a production ingress path before it ships.

The May 15 Actions incident followed a planned infrastructure failover during which an automated service discovery update failed to propagate correctly. Traffic routed to the wrong place, timeouts climbed in a core workflow-orchestration dependency, and 42% of Actions runs failed at peak. Pages and Copilot cloud services, both downstream of Actions, went with it. Responders corrected the routing manually. The fixes are failover guardrails that validate service discovery state before a failover completes, plus stronger verification checks.

The May 26 incident is the standout. GitHub's automated account-review system incorrectly suspended the service account that GitHub Actions uses to authenticate workflow runs and download actions. From 10:40 to 12:16 UTC, every newly queued run failed to start. Pages, Copilot code review, Copilot coding agent, Octoshift, and GitHub Enterprise Importer all failed along with it. A side effect of disabling the account was that a small number of issues, pull requests, comments, and discussions were marked hidden. No data was lost, and all hidden content was later restored along with the search index.

The fix is an allowlist of service accounts that automated systems cannot suspend, enforced consistently across account-management tooling. An automated safety system did its job too literally and turned its enforcement against critical infrastructure. The guardrail is to teach the automation which accounts are load-bearing.

An upstream model provider

The May 28 incident sits slightly apart. The Copilot service degraded because an upstream provider's Responses API returned elevated errors for the GPT-5.2, GPT-5.3-Codex, GPT-5.4, and GPT-5.5 models. Copilot coding agent and code review were affected; other models were not. GitHub shifted traffic away from the affected models while the provider deployed a fix, and is working on automated failover for those models.

This is the dependency story pointed outward. As more of the platform's surface area depends on model providers, the availability of those providers becomes part of GitHub's own availability, and automated failover between models becomes an infrastructure concern rather than a product nicety.

The throughline

Nearly every incident in this report is a coupling failure. A migration coupled to peak traffic. A primary key coupled to a lookup key. A remediation coupled to the next day's allocation path. A service account coupled to an automated review system that did not know it was special. Replication lag on idle replicas coupled to the events that start Copilot agents.

That is exactly the class of problem the larger architectural work is meant to retire. Splitting users, authentication, and authorization into isolated domains, removing per-request database lookups with stateless tokens, and moving to elastic Azure capacity all attack the same underlying issue: too many things sharing a single failure point. The May incidents are a snapshot of a system mid-transition, where the old coupling still exists in places and the new isolation has not yet reached everything.

GitHub points readers to its status page for real-time updates and the engineering section of the GitHub Blog for deeper writeups. The honest version of a reliability report is one that names the mechanisms, and this one does. Reading the root causes back to back is a better education in distributed-systems failure than most postmortems offer one at a time.

#Infrastructure #Reliability #postmortem #availability #distributed systems