Waymo Robotaxi Gridlock: When Safety Protocols Meet Real-World Saturation

Article illustration 1

San Francisco's Waymo fleet faced an unexpected challenge during a citywide power outage.

Last Sunday's power outage in San Francisco revealed an unforeseen vulnerability in autonomous vehicle operations: Waymo robotaxis became immobilized across the city due to system saturation. Designed to treat non-functioning traffic signals as four-way stops, the vehicles instead flooded remote operators with confirmation requests, overwhelming human capacity and creating traffic bottlenecks.

The Saturation Point

Waymo's incident report explains:

"While the Waymo Driver is designed to handle dark traffic signals as four-way stops, it may occasionally request a confirmation check to ensure it makes the safest choice. While we successfully traversed more than 7,000 dark signals on Saturday, the outage created a concentrated spike in these requests. This created a backlog that, in some cases, led to response delays contributing to congestion on already-overwhelmed streets."

These "confirmation checks" – likely human validations via remote teleoperators – became the system's critical bottleneck. Unlike cloud resources that scale dynamically, human operators couldn't instantly expand to meet demand during the citywide emergency.

Engineering Irony: Safety Creates Vulnerability

This represents a classic saturation failure, where finite resources (human operators) became exhausted under abnormal load. As Lorin Hochstein notes, "saturation is an ever-present risk" in all systems with finite resources. Ironically, the confirmation protocol was explicitly implemented for safety:

"We established these confirmation protocols out of an abundance of caution during our early deployment... While this strategy was effective during smaller outages, we are now refining them to match our current scale."

The incident underscores how safety measures can unintentionally create novel failure modes at scale. Waymo is now implementing "fleet-wide updates that provide the Driver with specific power outage context," reducing reliance on human verification.

Lessons for Engineers

  1. Scalability Isn't Just Technical: Human-in-the-loop systems require different scaling strategies than pure software
  2. Edge Cases Multiply: Rare events (like citywide outages) become near-certainties at sufficient scale
  3. Graceful Degradation: Systems need fallback modes when oversight resources are unavailable
  4. Trade-off Awareness: Every safety layer introduces potential new points of failure

As autonomous systems proliferate, this incident serves as a vital case study in designing for real-world chaos. The path forward requires balancing caution with autonomy – recognizing that sometimes the safest system is one that can function independently when the unexpected occurs.

Source: Surfing Complexity Blog