When the Grand Prix Turns Into a Midnight Debug Session: Lessons From an On‑Call Misadventure

A senior DBA’s weekend in Hong Kong turned into a 4‑am crisis when a VAX/VMS billing system failed after an OS upgrade. The story highlights the perils of complacent on‑call culture, the hidden costs of legacy stacks, and why modern observability can’t replace good process.

Jemaine, a veteran DBA who spent the early 1990s on VAX/VMS platforms, thought he had handed off a routine OS upgrade to the local team. The client, a telecom operator in Macau, had arranged for two on‑site DBAs and gave him a hotel room with a perfect view of the Grand Prix circuit. After the race, a few bottles of Portuguese red and a lavish dinner seemed to seal the deal – until his pager rang at dessert time.

He raced to the client’s office only to find the billing application refusing to start. The on‑site staff had already re‑installed the operating system twice and were now blaming the database. While the DBAs rebuilt the database (a process that stretched into the early hours), Jemaine “sobered up” in a back room and prepared for the next round of troubleshooting.

The technical rabbit hole

When the database was finally back online, the application still wouldn’t launch. A quick health check showed the database was fine, but the batch scheduler was dead. The lead developer, who was off‑site, suggested stepping through a massive COBOL program in DEBUG mode over the phone – a classic example of “remote debugging on legacy code.”

The breakthrough came at 4 am when Jemaine asked a simple question: “What account are you testing under?” The answer, Administrator, revealed a permission change introduced by the OS upgrade. Running the batch job with elevated privileges made the system work again.

Community sentiment: the on‑call myth

Stories like Jemaine’s resonate because they expose a common belief that “once the upgrade is handed off, the job is done.” In many legacy environments, on‑call rotations are still treated as a courtesy rather than a responsibility. The sentiment on forums such as the r/sysadmin subreddit and the Server Fault community is that this attitude leads to “fire‑fighting” rather than proactive reliability work.

Adoption signals

Increased investment in observability – Tools like Prometheus, Grafana, and OpenTelemetry are seeing higher adoption in enterprises still running legacy stacks, precisely to surface permission‑related failures before they hit production.
Shift‑left on‑call training – Companies are rolling out onboarding modules that simulate “night‑time incidents” for new hires, a trend highlighted in the recent Google SRE Handbook update.
Policy‑as‑code for privileges – Projects such as OPA (Open Policy Agent) are gaining traction for codifying permission changes, reducing the chance that an OS upgrade silently breaks batch queues.

Counter‑perspectives: not all legacy is doomed

Some practitioners argue that the drama of Jemaine’s night is over‑blown. They point out that:

Human error is inevitable – Even with perfect monitoring, a mis‑typed permission or a forgotten admin account can slip through. The focus should be on rapid rollback mechanisms, not on eliminating every possible mistake.
Cost of full modernization – Re‑architecting a VAX/VMS‑based billing system to a cloud‑native stack can run into the millions. For many telcos, the incremental fixes described above are the only financially viable path.
Cultural factors – In regions where on‑call duties are seen as a perk rather than a duty, the “relax‑and‑wine” mindset may actually improve morale, provided the team has clear escalation paths.

What can teams learn?

Never assume a hand‑off is final – Even if a local crew is present, maintain a minimal monitoring window for at least 24 hours after a major change.
Document permission changes – Treat OS upgrades as a chance to audit all privileged operations. A simple checklist can catch the “new batch‑queue permission” issue before it surfaces.
Automate the boring parts – Scripts that verify service start‑up, batch queue health, and admin‑level access can run automatically after an upgrade, sending a clear “all‑good” signal to on‑call staff.
Invest in remote debugging tools – Modern IDEs with remote attach capabilities (e.g., VS Code Remote Development) reduce the need for painful phone‑based step‑throughs of COBOL code.

A cautionary tale for the next generation

Jemaine’s story is a reminder that on‑call work is rarely a clean hand‑off. It underscores the importance of process hygiene, observable systems, and clear privilege management. While the wine may have been forgotten, the lesson remains fresh: a pager can ring at any hour, and the real job often starts when the celebration ends.

If you have an on‑call anecdote that turned a weekend into a marathon debugging session, share it with the On‑Call column for a future feature.

#on-call #Legacy Systems #Observability #permission management #remote debugging