NVIDIA's Open Kernel Driver Hits a 66-Day Wall: A Deep Dive into the B200 NVLink Hang

A critical bug in NVIDIA's open-source GPU kernel driver causes nvidia-smi and NVLink-dependent applications to hang after approximately 66 days and 12 hours of uptime on B200 GPUs, revealing a potential jiffies-based timing overflow that affects multiple driver versions.

The NVIDIA open-source kernel driver, a project that has been gaining traction in the Linux community, has encountered a significant and oddly precise failure mode. Users running the driver on the latest B200 GPUs are reporting that nvidia-smi and any application relying on NVLink communication become unresponsive after exactly 66 days and 12 hours of system uptime.

The issue, documented in GitHub issue #971 on the official NVIDIA repository, was first reported by user zheng199512 in late 2025. The report details a system running the 570.133.20 driver on an NVIDIA B200 GPU with kernel 6.6.0. After the critical uptime threshold, the nvidia-smi command hangs indefinitely. The system logs (dmesg) reveal a cascade of NVLink-related errors, specifically knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! and knvlinkDiscoverPostRxDetLinks_GH100: Getting peer's postRxDetLinkMask failed!. These errors indicate a failure in the GPU's ability to manage its high-speed interconnects, which are crucial for multi-GPU communication in data centers and AI clusters.

This isn't an isolated incident. The community quickly chimed in with similar experiences. User t-shunsuke01 reported the same behavior on a B200 running Ubuntu 24.04 with driver versions 580.82.07 and 580.105.08, confirming the bug persists across multiple driver branches. Another user, lmacken, provided a concrete timeline: two B200s, both booted on October 31st, failed on January 6th, matching the ~67-day uptime. The most alarming report came from jquesnelle, who experienced the failure across a 256-GPU B200 cluster, causing widespread job failures.

NVIDIA engineer mtijanic acknowledged the issue, confirming it affects the 570 series and has been reproduced on the 580 series. The problem has been escalated to the internal NVLink team under bug IDs 5746052 and 5607938. While the official response is that a fix is in the works, no timeline has been provided.

The 66-Day Mystery: A Jiffies Overflow?

The specificity of the failure time—66 days and 12 hours—led the community to speculate about a root cause. User ma-ts investigated and proposed a compelling theory: a jiffies-based timing overflow. The Linux kernel uses a variable called jiffies to count system ticks. The rate of these ticks is defined by the kernel configuration CONFIG_HZ. For a common setting of CONFIG_HZ=750, the 32-bit jiffies counter wraps back to zero after approximately 66 days and 12 hours. This is a classic integer overflow problem.

If the NVIDIA driver's NVLink state machine relies on jiffies for timing calculations or timeouts, hitting this wrap-around point could cause the logic to break, leading to the observed failures. This theory is supported by the fact that the errors are related to NVLink link detection and mask updates, processes that likely involve timed operations.

A pull request, PR #1014, was mentioned in the discussion. This PR suggests moving from jiffies to ktime_get_raw_ts64(), a more robust timing function that uses 64-bit nanoseconds and is not susceptible to this specific overflow. This aligns perfectly with the proposed diagnosis.

Community Impact and Counter-Perspectives

The impact is substantial for anyone running large-scale B200 deployments. The failure doesn't just affect monitoring tools like nvidia-smi; it cripples any application that uses NVLink for GPU-to-GPU communication. In AI and HPC workloads, where multi-GPU coordination is standard, this bug can bring entire clusters to a halt, requiring a full system reboot to recover.

However, it's important to note the scope. The issue appears specific to the open-source kernel driver (NVIDIA/open-gpu-kernel-modules). The reporter confirmed that the same problem does not occur with the proprietary driver of the same version. This suggests the bug may have been introduced during the porting process or is a unique flaw in the open-source implementation's timing logic.

Furthermore, the bug seems tied to specific hardware (B200) and the NVLink infrastructure. Systems using single GPUs or not relying on NVLink might not experience the hang, though the underlying driver instability could still pose a risk. NVIDIA's internal tracking of the issue under multiple bug IDs indicates it's a recognized and serious problem within their development teams.

What to Watch For

For now, administrators of B200 clusters using the open-source driver have a few options:

Schedule Reboots: Implement a maintenance schedule to reboot systems before the 66-day mark. This is a temporary workaround but ensures stability.
Monitor Logs: Keep an eye on dmesg for the telltale NVLink errors. Early warnings might provide a window to act before a full hang.
Consider the Proprietary Driver: If stability is paramount and the open-source driver's features aren't critical, reverting to the proprietary driver may be a prudent choice until a fix is released.

The community is actively discussing potential patches and workarounds. The proposed fix in PR #1014 is a strong candidate, but it requires validation and integration into a stable driver release. Until NVIDIA officially patches the issue, this 66-day time bomb remains a critical consideration for any deployment of NVIDIA's open-source driver on Blackwell architecture GPUs.

#Nvidia #GPU driver #Vulnerabilities #Linux kernel #NVLink

NVIDIA's Open Kernel Driver Hits a 66-Day Wall: A Deep Dive into the B200 NVLink Hang

The 66-Day Mystery: A Jiffies Overflow?

Community Impact and Counter-Perspectives

What to Watch For

Comments