macOS Has a 49.7-Day Networking Time Bomb Built In That Only a Reboot Fixes

A critical networking bug in macOS causes machines to stop accepting new TCP connections after 49.7 days of uptime, due to an integer overflow in the TCP timestamp counter that prevents proper connection cleanup.

A critical bug in macOS's networking stack has been discovered that causes machines to stop accepting new TCP connections after running for exactly 49 days, 17 hours, 2 minutes, and 47 seconds. The issue, uncovered by the team at Photon while monitoring iMessage services, stems from an integer overflow in Apple's XNU kernel that effectively freezes the TCP timestamp counter and prevents proper cleanup of closed connections.

The 49.7-Day Time Bomb

The problem centers on the tcp_now internal counter, which tracks the current time since boot as far as the TCP stack is concerned, down down to the millisecond. This counter is represented as a 32-bit unsigned integer, which has a maximum value of 4,294,967,295 (2^32 - 1) before it wraps around to zero. Since tcp_now tracks milliseconds, this maximum represents 4,294,967 seconds, or precisely 49.7 days.

When tcp_now reaches this limit, a bug in Apple's implementation causes it to get stuck at its maximum value rather than wrapping around correctly. This creates a cascade of problems for the TCP connection management system.

How the Bug Breaks Networking

According to TCP standards, operating systems are supposed to collect and remove closed TCP connections after a short period—30 seconds in the case of macOS. However, when tcp_now is frozen at its maximum value, any connection's expiration status is calculated against this frozen number.

The comparison math doesn't work correctly when dealing with values near the 32-bit unsigned integer limit. As a result, the periodic check that determines whether a closed connection should be deleted always returns "no" because the calculation overflows. This means closed connections never get cleaned up properly.

The Consequences

With closed connections never being removed, the TCP stack fills up with errantly held ephemeral ports. Since these ports are tied up indefinitely, the system eventually runs out of available ports for new connections. The rate at which this happens depends on network activity, but in any server or professional environment with regular network traffic, it's a rapid event.

Machines affected by this bug continue to respond to ping requests, making the issue particularly difficult to diagnose. Existing network connections remain functional, but the inability to establish new connections effectively renders the machine unusable for most network-dependent tasks.

Historical Context

This class of problems is well-known in computing history. Integer overflows have been responsible for various system failures, most notably Windows 98's famous 49.7-day crash. The computing industry is also bracing for the upcoming Year 2038 problem, where 32-bit systems will overflow their time counters.

The long-existing RFC 7323 specifies what should happen to the timestamp clock (tcp_now) when it reaches its limit, but Apple's kernel performs an incorrect implementation that fails to handle the overflow properly.

Discovery and Testing

The Photon team first encountered this issue when some machines in their fleet suddenly stopped responding to network connections despite answering ping requests. After rebooting the affected machines, they noticed another set approaching the 49.7-day uptime threshold and set up scripts to test their theory.

Their testing confirmed that when the fateful moment arrived, machines stopped creating new connections without any error messages. The team's systematic approach to identifying and reproducing the bug demonstrates the importance of thorough testing in production environments.

Current Mitigation and Future Fixes

According to Photon, the current mitigation is a simple but disruptive one: reboot the affected machines. Any systems administrator will recognize this as an unsatisfactory "solution" to a mystery issue, as it doesn't address the root cause and guarantees the problem will recur.

The team reports they're working on an alternative solution while Apple presumably works on a proper fix. Given the severity and visibility of this issue, it's likely to be addressed quickly—hopefully before 49.7 days after the report's publication.

Broader Implications

This bug highlights several important points about modern computing systems. First, it demonstrates how seemingly obscure technical details—like the maximum value of a 32-bit unsigned integer—can have real-world impacts on system reliability. Second, it shows how difficult it can be to diagnose intermittent issues that don't produce clear error messages.

The fact that this bug affects macOS's ability to function as a server or 24/7 system reinforces observations about the operating system's design philosophy. While macOS has Unix roots, it's not optimized for unattended, continuous operation in the way that traditional server operating systems are.

For organizations running macOS in server-like roles or production environments, this discovery necessitates monitoring uptime closely and implementing regular reboot schedules before the 49.7-day threshold. For individual users who rarely leave their Macs running for extended periods, the impact may be minimal, but it's a reminder of the hidden complexities in modern operating systems.

The discovery also serves as a cautionary tale about the importance of proper integer handling and adherence to networking standards in operating system development. As systems continue to run longer between reboots, such edge cases become increasingly relevant to system stability and reliability.

#macOS #TCP #integer overflow #Networking #XNU kernel