GitHub's December 2025 availability report reveals five incidents affecting AI Controls, Copilot Code Review, Actions runners, policy management, and unauthenticated traffic. The outages, caused by configuration errors, model latency, network issues, schema drift, and traffic spikes, highlight the complexity of modern cloud platforms and the need for better monitoring, validation, and resilience patterns.
GitHub's December 2025 availability report, published by Jakub Oleksy, details five incidents that degraded performance across multiple services. The outages spanned AI-powered features, CI/CD infrastructure, and core platform components, revealing the cascading effects of configuration errors, model dependencies, network issues, and traffic spikes in a large-scale cloud platform.

Incident 1: AI Controls Data Pipeline Failure
Date: December 8, 19:51 UTC (1h 15m)
Enterprise administrators lost the ability to view agent session activities in the AI Controls page for nearly two weeks—from November 26 to December 8. The issue didn't affect audit logs or direct navigation to individual session logs, but it blocked the centralized view used for managing AI agents.
Root Cause: A misconfiguration in a November 25 deployment prevented data from being published to an internal Kafka topic that feeds the AI Controls page. Kafka acts as a message broker, streaming event data between services. When the publisher configuration broke, the downstream consumer for the AI Controls UI had no data to display.
Mitigation: The team corrected the configuration on December 8, restoring data flow.
Prevention: GitHub is improving monitoring for data pipeline dependencies and adding pre-deployment validation to catch configuration errors before production.
This incident illustrates a common pattern in event-driven architectures: when a single Kafka topic fails, it can silently break a UI feature without affecting the underlying data storage. The audit logs remained intact because they likely consume from a different topic or use a separate ingestion path. For enterprise users, this meant compliance data was available, but operational visibility was lost.
Incident 2: Copilot Code Review Model Latency
Date: December 15, 17:43 UTC (39 minutes)
Copilot Code Review failed on 46.97% of pull request review requests during the incident window. Users saw an error message prompting them to re-request a review. The remaining requests completed successfully.
Root Cause: An internal, model-backed dependency experienced elevated response times. This triggered request timeouts and backpressure in the review processing pipeline, causing queue growth and failed completions.
Mitigation: The team took three actions:
- Temporarily bypassed fix suggestions to reduce latency
- Increased worker capacity to drain the backlog
- Deployed a model configuration change to reduce end-to-end latency
Post-incident: GitHub increased baseline worker capacity, added instrumentation for worker utilization and queue health, and is improving automatic load-shedding, fallback behavior, and alerting.
This incident shows the challenge of integrating AI models into production pipelines. Models can be unpredictable—response times vary based on input complexity, model state, and infrastructure load. The backpressure mechanism, designed to protect the system, instead created a cascading failure. By bypassing fix suggestions (a core feature), the team traded functionality for availability, a classic reliability trade-off.
Incident 3: Actions Runner Network Packet Loss
Date: December 18, 16:33 UTC (1h 8m)
GitHub Actions runners in the West US region experienced intermittent timeouts for GitHub API calls, causing failures during runner setup and workflow execution.
Root Cause: Network packet loss between runners and one of GitHub's edge sites. Approximately 1.5% of jobs on larger and standard hosted runners in West US were impacted (0.28% of all Actions jobs).
Mitigation: By 17:11 UTC, all traffic was routed away from the affected edge site.
Prevention: GitHub is working on improved early detection of cross-cloud connectivity issues and faster mitigation paths.
This is a classic edge routing problem. Actions runners need to authenticate and fetch job definitions from GitHub's API during startup. Packet loss at the edge creates timeouts that cascade to workflow failures. The low overall impact (0.28% of all jobs) suggests the issue was isolated to a specific edge site or routing path. For affected users, though, their CI/CD pipeline was completely blocked.
Incident 4: Copilot Policy Management Schema Drift
Date: December 18, 17:36 UTC (1h 33m)
Users, organizations, and enterprises could not update any Copilot policies during this window. No other GitHub services were affected.
Root Cause: A database migration caused a schema drift. Schema drift occurs when the expected database structure (tables, columns, constraints) doesn't match the actual structure, often due to incomplete or failed migrations.
Mitigation: The team synchronized the schema to resolve the drift.
Prevention: GitHub hardened the service to prevent future schema drift and is investigating deployment pipeline improvements to reduce mitigation time.
Schema drift is a critical failure mode in database-driven services. The policy management service likely had multiple instances running with different schema versions, or a migration partially failed. This prevented writes to the policy tables. The fact that other Copilot services were unaffected suggests they use separate databases or tables, isolating the blast radius.
Incident 5: Unauthenticated Traffic Spike
Date: December 22, 22:31 UTC (1h 46m)
Unauthenticated requests to github.com were degraded, causing slow or timed-out page loads and API requests. This affected Actions jobs making unauthenticated requests, such as release downloads. Authenticated traffic was not impacted.
Root Cause: A severe spike in traffic, primarily to search endpoints.
Mitigation: The team identified and mitigated the traffic source, and automated traffic management restored service.
Prevention: GitHub improved limiters for load to relevant endpoints and is continuing work to proactively identify large traffic changes, improve resilience in critical request flows, and reduce time to mitigation.
This incident resembles a DDoS attack or a misconfigured crawler hitting search endpoints aggressively. Unauthenticated traffic is easier to abuse because it doesn't require API tokens or session management. The search endpoints are particularly vulnerable because they're computationally expensive—searching large codebases requires significant resources. GitHub's rate limiters and automated traffic management eventually caught up, but not before affecting legitimate users.
Patterns and Architectural Lessons
Event-Driven Complexity
The AI Controls incident shows how Kafka-based pipelines create hidden dependencies. A configuration error in one publisher can break a UI without affecting data storage. Better monitoring of data flow (not just system health) would have caught this sooner.
AI Model Dependencies
The Copilot Code Review incident demonstrates that AI models are unpredictable dependencies. Production systems need:
- Fallback mechanisms (bypassing features)
- Queue monitoring and alerting
- Automatic load-shedding
- Capacity planning for model inference
Edge Routing and Connectivity
The Actions runner incident highlights the fragility of cross-cloud networking. Packet loss at a single edge site can break CI/CD for a region. Faster detection and routing changes are critical.
Database Schema Management
Schema drift in the policy service shows that migrations need:
- Pre-deployment validation
- Runtime schema synchronization
- Hardened deployment pipelines
Traffic Management
The search endpoint spike shows that unauthenticated endpoints need aggressive rate limiting and anomaly detection. Proactive monitoring of traffic patterns is essential.

What This Means for Platform Teams
These incidents reveal that even mature platforms like GitHub face cascading failures from configuration errors, model latency, network issues, schema drift, and traffic spikes. For teams building on GitHub, the lessons are:
- Assume partial failures: Design workflows to handle intermittent API failures
- Monitor dependencies: Track not just service health, but data flow and queue depth
- Plan for AI unpredictability: Build fallbacks for AI-powered features
- Validate configurations: Pre-deployment checks can prevent production issues
- Watch traffic patterns: Unusual traffic can overwhelm endpoints
GitHub's post-incident improvements—better monitoring, validation, load-shedding, and faster mitigation—reflect a shift toward proactive resilience rather than reactive fixes.
For the full details, see the GitHub Availability Report on the GitHub Blog.

Comments
Please log in or register to join the discussion