Oxide engineers traced intermittent loss of network connectivity on the Cosmo Service Processor to a mismatched memory attribute on the STM32H7 FMC bus, fixing the bug by moving the FPGA interface to a properly‑typed address region.

A disappearing Service Processor

11 DEC 2025 – Laura Abbott, Engineer, Oxide Computer Company

When we first tried to slot the next‑generation Cosmo sled into an Oxide rack, the Service Processor (SP) would sometimes vanish from the management network. The rack itself was still alive – the AMD host CPU kept running, fans spun at a high constant speed, and the chassis lights behaved oddly – but the SP stopped answering any network traffic. Without remote access we were essentially blind.

The problem we needed to solve

Network disappearance – the SP stopped broadcasting its presence on the management VLAN.
No traffic counters – the NIC’s packet counters stayed flat.
Fans at full speed – the SP normally dials fan speed, so the constant high RPM hinted that the fan controller had fallen back to an emergency mode.
Reproducibility only in‑rack – the same sled behaved normally on a bench.

These clues pointed to a software or hardware fault that only manifested under the exact electrical and timing conditions of a rack‑mounted system.

First hypotheses: task starvation and stack overflow

The SP runs Hubris, Oxide’s custom Rust‑based OS. Each subsystem – networking, thermal control, OTA updates – lives in its own task. Hubris isn’t a hard real‑time OS, but it does prioritize tasks. We wondered whether a runaway task was starving the networking task, perhaps by looping forever after a crash and repeatedly restarting.

We lengthened the task‑restart back‑off to see if a stuck restart loop was the cause.
We switched the chassis LED from steady on to blinking to get a visual heartbeat even when the network was dead. The LED behaved inconsistently – sometimes stuck on, sometimes off – which suggested the problem wasn’t confined to a single low‑priority task.

Rust eliminates buffer‑overflow bugs, but Hubris still requires manual stack sizing. An oversized stack can hide a stack‑overflow condition; a too‑small stack triggers a safe restart. The kernel’s stacks are generous (512 B guard), so we ruled out a kernel stack overflow.

Pulling the SWD debug header

Because the SP is a production unit, we rarely attach a debug probe. Still, we rigged a Serial‑Wire‑Debug (SWD) cable to the hidden header on the sled, with help from colleagues in the Oxide office. The probe didn’t let us halt the CPU – the Cortex‑M7 simply refused a debug halt when the fault was active – but it gave us a way to observe the system.

The FPGA‑FMC bus clue

Cosmo’s biggest hardware change from the first‑generation Gimlet is an FPGA that mediates host flash and other peripherals. The FPGA is attached to the STM32H7 via the Flexible Memory Controller (FMC), a parallel bus that the CPU treats like external RAM.

The FMC manual (RM0433, §22.1) says its job is to translate AXI‑style transactions to the external device protocol and meet timing requirements. If the CPU never receives an acknowledgement from the FPGA, it can hang indefinitely on a bus access.

To test this, we programmed a minimal FPGA image that deliberately stalled on a read of a specific register. The SP exhibited the exact same symptoms: network silence, fans at full speed, and no LED heartbeat. This strongly implicated the FMC path.

Vector‑catch reset gave us a glimpse

ARM CPUs support a vector‑catch mode that halts the core immediately after reset, before any instruction executes. By enabling this, we could reset the SP, capture a dump of RAM, and see which Hubris task was active at the moment of the hang.

The dump showed a task that was not the networking task, and nothing obvious was touching the FMC. However, the dump was noisy because the CPU cache retained stale data. Disabling the cache gave a clean view, but we still couldn’t reproduce the hang.

A security‑boot experiment amplified the bug

During a separate effort we added a Root‑of‑Trust (RoT) measurement step. The SP now hashes its flash on boot, and to guarantee integrity it may reset itself several times before the OS fully starts. This change made the disappearance reproducible in 10–20 minutes instead of the previous 24‑hour window.

With a reliable reproduction cadence we tried a battery of mitigations:

Varying reset intervals and counts.
Clearing the FPGA bitstream between resets.
Blocking all tasks from accessing the FMC.
Stripping out unrelated tasks.

None of these eliminated the issue.

The hidden culprit: mismatched memory attributes

The STM32H7’s Memory Protection Unit (MPU) isolates unprivileged tasks, but the privileged kernel still uses the default memory map. In Hubris we map the FMC region as Uncached Device Memory for tasks, which is correct for a peripheral bus.

Unfortunately, the base address we chose for the FMC (0x6000_0000 range) is marked Normal Cached in the default map. This means that when the kernel (running in privileged mode) accesses that address, the CPU treats it as cacheable memory. The ARMv7‑M reference (section A3.5.7) warns that mismatched attributes can cause the processor to lose the “preservation of the size of accesses” guarantee, leading to 16‑ or 8‑bit writes that the FPGA cannot handle.

Our likely scenario:

An unprivileged task writes to the FMC (store buffer).
An interrupt fires, switching to privileged mode.
The privileged code accesses the same address using the default (cached) attributes.
The cache attempts a 16‑bit write, violating the FPGA’s 32‑bit‑only contract.
The FMC bus hangs, the CPU stalls, and the SP appears dead.

ARM explicitly recommends never using aliasing with different attributes for the same physical location. The fix was straightforward once we understood it: move the FMC to a base address that the default map already classifies as Device (non‑cached). The STM32H7 allows the FMC to be remapped, so we changed the base to 0xA000_0000, a region defined as Device memory.

Resolution and takeaways

After merging the address‑remap change, the SP has remained stable in every rack deployment we’ve tested. The issue has not resurfaced.

Key lessons for anyone building similar tightly‑coupled CPU‑FPGA systems:

Check default memory attributes for any peripheral address you expose to both privileged and unprivileged code.
Avoid aliasing – use a single, consistent address mapping.
Leverage SWD even on production units when you can; a simple probe can turn a blind‑spot investigation into a data‑driven one.
Document vendor quirks – the ARM and ST manuals eventually explained the problem, but the relevant paragraph is buried deep. Highlighting such edge cases helps the whole ecosystem.

Looking forward

The Cosmo sled is now shipping with the corrected mapping, and the SP’s reliability under heavy reset storms has been validated. Oxide remains committed to transparency; we will continue to publish post‑mortems like this one so that other teams can avoid the same pitfalls.

Transparency, rigorous debugging, and a willingness to question assumptions keep our racks humming – even when the Service Processor tries to disappear.

#Embedded Systems #Rust #FPGA #ARM #Debugging

A disappearing Service Processor

A disappearing Service Processor

The problem we needed to solve

First hypotheses: task starvation and stack overflow

Pulling the SWD debug header

The FPGA‑FMC bus clue

Vector‑catch reset gave us a glimpse

A security‑boot experiment amplified the bug

The hidden culprit: mismatched memory attributes

Resolution and takeaways

Looking forward

Comments