How GitHub Uses eBPF to Eliminate Deployment Circular Dependencies

GitHub developed an eBPF-based system to prevent deployment scripts from creating circular dependencies that could hinder incident recovery. By selectively blocking problematic network calls at the kernel level and tracking offending processes, they improved deployment safety without impacting production traffic.

GitHub’s internal reliance on github.com creates a fundamental challenge: deploying fixes requires access to the very platform that might be down during an outage. While maintaining code mirrors and built assets addresses the surface issue, deeper circular dependencies lurk in deployment scripts themselves—dependencies that can silently sabotage recovery efforts. Lawrence Gripper and Aleksey Levenstein detail how GitHub’s platform engineering team used eBPF to solve this problem, transforming a manual, error-prone process into an automated safety net.

The Circular Dependency Trap

Consider a MySQL outage impacting GitHub’s ability to serve release data. To recover, engineers must deploy a configuration change via a deploy script on affected nodes. Three dependency types can derail this process:

Direct dependency: The script tries to download a tool release from github.com (now inaccessible due to the outage).
Hidden dependency: A local tool checks github.com for updates during execution, causing hangs or failures based on its error handling.
Transient dependency: The script calls an internal service (e.g., a migration tool) that itself attempts to fetch a binary from github.com, propagating the failure back.

Traditional mitigation—blocking all github.com access on deployment hosts—isn’t feasible. These stateful hosts handle live customer traffic during rolling deploys; cutting off github.com would disrupt production requests. The team needed granular control: block only the specific calls creating circular dependencies while allowing legitimate traffic.

eBPF as a Scalpel, Not a Hammer

GitHub turned to eBPF (extended Berkeley Packet Filter) for its ability to run sandboxed programs in the Linux kernel, hooking into system events like network packets or syscalls. They focused on two program types:

BPF_PROG_TYPE_CGROUP_SKB: Filters network egress traffic for processes in a specific control group (cGroup).
BPF_PROG_TYPE_CGROUP_SOCK_ADDR: Modifies socket connection attempts (e.g., rewriting DNS queries).

Their solution isolates deployment scripts in a dedicated cGroup. Within this cGroup:

DNS queries are intercepted and redirected to a userspace DNS proxy.
The proxy checks requested domains against a blocklist (e.g., github.com) and communicates verdicts back to the kernel via eBPF Maps.
If a domain is blocked, the network packet is dropped; otherwise, traffic flows normally.

This approach avoids maintaining static IP blocklists—critical given GitHub’s dynamic infrastructure. Instead, the DNS proxy handles domain-to-IP resolution dynamically, applying policy at the moment of lookup.

![Diagram showing a MySQL deploy script fails after attempting to pull the latest release of an open source tool from GitHub.]()

Figure: Direct circular dependency scenario where a deploy script fails to fetch a tool release from github.com during an outage.

Beyond Blocking: Debugging and Accountability

Blocking problematic calls is only half the battle. Teams need to know why a block occurred to fix the underlying dependency. GitHub enhanced their eBPF program to capture forensic data:

When a DNS query is blocked, the kernel program records the DNS transaction ID and the process ID (PID) that initiated the request.
This data is stored in an eBPF Map (DNS Transaction ID → PID).
The userspace DNS proxy reads the map, looks up the PID in /proc/{PID}/cmdline, and retrieves the full command line that triggered the request.
A structured log emerges: WARN DNS BLOCKED reason=FromDNSRequest blocked=true blockedAt=dns domain=github.com pid=266767 cmd="curl github.com " firewallMethod=blocklist

This transforms opaque failures into actionable insights. Teams immediately see which command (e.g., a curl in a deploy script) caused the block and can update their tooling to remove the dependency.

Additional Safety Nets

The cGroup isolation provides further benefits beyond network filtering:

Resource limits: CPU and memory constraints can be applied to the deployment cGroup, preventing runaway scripts from starving production workloads on the same host.
Audit trails: All domain contacts during a deployment are logged, offering visibility into dependency chains for security and compliance reviews.

Impact and Adoption

After a six-month phased rollout, the system is now live across GitHub’s fleet. The outcome is measurable:

Faster recovery: Mean time to recovery (MTTR) improves because deployment scripts no longer fail due to self-inflicted circular dependencies during incidents.
Reduced toil: Teams spend less time debugging deployment failures caused by hidden dependencies.
Proactive prevention: The system flags problematic dependencies before they cause incidents—e.g., if a tool update introduces a new github.com check.

GitHub emphasizes this isn’t a silver bullet. New circular dependency patterns may emerge (e.g., via non-DNS mechanisms), requiring ongoing tool refinement. However, the foundation establishes a repeatable process: isolate deployment contexts, monitor critical syscalls, and enforce policy at the kernel level with minimal performance overhead.

Getting Started with eBPF for Deployment Safety

Teams interested in similar approaches can begin with:

The cilium/ebpf Go library, which simplifies eBPF program loading and map interaction (used in GitHub’s proof of concept).
docs.ebpf.io for comprehensive eBPF concepts and program types.
Open source tools like bpftrace for dynamic tracing or ptcpdump for container-aware TCP dumping.

As Griper and Levenstein note, the goal isn’t to eliminate all external dependencies but to ensure deployment scripts don’t reintroduce the very platforms they aim to fix. By making circular dependencies visible and blockable at the source, GitHub has turned a latent risk into a detectable, controllable variable—making their infrastructure more resilient exactly when it’s needed most.

#eBPF #Deployment #Infrastructure #Linux #network-security