GitHub's December 2025 Outages: AI Controls, Copilot, Actions, and Traffic Spikes

GitHub's December 2025 availability report reveals five incidents affecting AI Controls, Copilot Code Review, Actions runners, policy management, and unauthenticated traffic. The outages, caused by configuration errors, model latency, network issues, schema drift, and traffic spikes, highlight the complexity of modern cloud platforms and the need for better monitoring, validation, and resilience patterns.

GitHub's December 2025 availability report, published by Jakub Oleksy, details five incidents that degraded performance across multiple services. The outages spanned AI-powered features, CI/CD infrastructure, and core platform components, revealing the cascading effects of configuration errors, model dependencies, network issues, and traffic spikes in a large-scale cloud platform.

Incident 1: AI Controls Data Pipeline Failure

Date: December 8, 19:51 UTC (1h 15m)

Enterprise administrators lost the ability to view agent session activities in the AI Controls page for nearly two weeks—from November 26 to December 8. The issue didn't affect audit logs or direct navigation to individual session logs, but it blocked the centralized view used for managing AI agents.

Root Cause: A misconfiguration in a November 25 deployment prevented data from being published to an internal Kafka topic that feeds the AI Controls page. Kafka acts as a message broker, streaming event data between services. When the publisher configuration broke, the downstream consumer for the AI Controls UI had no data to display.

Mitigation: The team corrected the configuration on December 8, restoring data flow.

Prevention: GitHub is improving monitoring for data pipeline dependencies and adding pre-deployment validation to catch configuration errors before production.

This incident illustrates a common pattern in event-driven architectures: when a single Kafka topic fails, it can silently break a UI feature without affecting the underlying data storage. The audit logs remained intact because they likely consume from a different topic or use a separate ingestion path. For enterprise users, this meant compliance data was available, but operational visibility was lost.

Incident 2: Copilot Code Review Model Latency

Date: December 15, 17:43 UTC (39 minutes)

Copilot Code Review failed on 46.97% of pull request review requests during the incident window. Users saw an error message prompting them to re-request a review. The remaining requests completed successfully.

Root Cause: An internal, model-backed dependency experienced elevated response times. This triggered request timeouts and backpressure in the review processing pipeline, causing queue growth and failed completions.

Mitigation: The team took three actions:

Temporarily bypassed fix suggestions to reduce latency
Increased worker capacity to drain the backlog
Deployed a model configuration change to reduce end-to-end latency

Post-incident: GitHub increased baseline worker capacity, added instrumentation for worker utilization and queue health, and is improving automatic load-shedding, fallback behavior, and alerting.

This incident shows the challenge of integrating AI models into production pipelines. Models can be unpredictable—response times vary based on input complexity, model state, and infrastructure load. The backpressure mechanism, designed to protect the system, instead created a cascading failure. By bypassing fix suggestions (a core feature), the team traded functionality for availability, a classic reliability trade-off.

Incident 3: Actions Runner Network Packet Loss

Date: December 18, 16:33 UTC (1h 8m)

GitHub Actions runners in the West US region experienced intermittent timeouts for GitHub API calls, causing failures during runner setup and workflow execution.

Root Cause: Network packet loss between runners and one of GitHub's edge sites. Approximately 1.5% of jobs on larger and standard hosted runners in West US were impacted (0.28% of all Actions jobs).

Mitigation: By 17:11 UTC, all traffic was routed away from the affected edge site.

Prevention: GitHub is working on improved early detection of cross-cloud connectivity issues and faster mitigation paths.

This is a classic edge routing problem. Actions runners need to authenticate and fetch job definitions from GitHub's API during startup. Packet loss at the edge creates timeouts that cascade to workflow failures. The low overall impact (0.28% of all jobs) suggests the issue was isolated to a specific edge site or routing path. For affected users, though, their CI/CD pipeline was completely blocked.

Incident 4: Copilot Policy Management Schema Drift

Date: December 18, 17:36 UTC (1h 33m)

Users, organizations, and enterprises could not update any Copilot policies during this window. No other GitHub services were affected.

Root Cause: A database migration caused a schema drift. Schema drift occurs when the expected database structure (tables, columns, constraints) doesn't match the actual structure, often due to incomplete or failed migrations.

Mitigation: The team synchronized the schema to resolve the drift.

Prevention: GitHub hardened the service to prevent future schema drift and is investigating deployment pipeline improvements to reduce mitigation time.

Schema drift is a critical failure mode in database-driven services. The policy management service likely had multiple instances running with different schema versions, or a migration partially failed. This prevented writes to the policy tables. The fact that other Copilot services were unaffected suggests they use separate databases or tables, isolating the blast radius.

Incident 5: Unauthenticated Traffic Spike

Date: December 22, 22:31 UTC (1h 46m)

Unauthenticated requests to github.com were degraded, causing slow or timed-out page loads and API requests. This affected Actions jobs making unauthenticated requests, such as release downloads. Authenticated traffic was not impacted.

Root Cause: A severe spike in traffic, primarily to search endpoints.

Mitigation: The team identified and mitigated the traffic source, and automated traffic management restored service.

Prevention: GitHub improved limiters for load to relevant endpoints and is continuing work to proactively identify large traffic changes, improve resilience in critical request flows, and reduce time to mitigation.

This incident resembles a DDoS attack or a misconfigured crawler hitting search endpoints aggressively. Unauthenticated traffic is easier to abuse because it doesn't require API tokens or session management. The search endpoints are particularly vulnerable because they're computationally expensive—searching large codebases requires significant resources. GitHub's rate limiters and automated traffic management eventually caught up, but not before affecting legitimate users.

Patterns and Architectural Lessons

Event-Driven Complexity

The AI Controls incident shows how Kafka-based pipelines create hidden dependencies. A configuration error in one publisher can break a UI without affecting data storage. Better monitoring of data flow (not just system health) would have caught this sooner.

AI Model Dependencies

The Copilot Code Review incident demonstrates that AI models are unpredictable dependencies. Production systems need:

Fallback mechanisms (bypassing features)
Queue monitoring and alerting
Automatic load-shedding
Capacity planning for model inference

Edge Routing and Connectivity

The Actions runner incident highlights the fragility of cross-cloud networking. Packet loss at a single edge site can break CI/CD for a region. Faster detection and routing changes are critical.

Database Schema Management

Schema drift in the policy service shows that migrations need:

Pre-deployment validation
Runtime schema synchronization
Hardened deployment pipelines

Traffic Management

The search endpoint spike shows that unauthenticated endpoints need aggressive rate limiting and anomaly detection. Proactive monitoring of traffic patterns is essential.

Illustration of a chemistry lab setup with interconnected glassware and tubing. Beakers, flasks, test tubes, and a large central reaction vessel contain glowing green and blue liquids. A pressure gauge, condenser coil, funnels, and heating plates appear along a blue tiled lab wall, with fluids flowing through tubes between the containers.

What This Means for Platform Teams

These incidents reveal that even mature platforms like GitHub face cascading failures from configuration errors, model latency, network issues, schema drift, and traffic spikes. For teams building on GitHub, the lessons are:

Assume partial failures: Design workflows to handle intermittent API failures
Monitor dependencies: Track not just service health, but data flow and queue depth
Plan for AI unpredictability: Build fallbacks for AI-powered features
Validate configurations: Pre-deployment checks can prevent production issues
Watch traffic patterns: Unusual traffic can overwhelm endpoints

GitHub's post-incident improvements—better monitoring, validation, load-shedding, and faster mitigation—reflect a shift toward proactive resilience rather than reactive fixes.

For the full details, see the GitHub Availability Report on the GitHub Blog.

#GitHub #Infrastructure #AI #Reliability #Traffic Management