An automated suspension of Railway’s production Google Cloud account on May 19 2026 triggered a cascade of failures across Railway’s multi‑cloud platform. The incident exposed a single point of failure in the network control plane and highlighted the challenges of restoring services after a provider‑level account lockout.

Post‑mortem: How an Automated GCP Suspension Propagated into an 8‑hour Railway Outage

Authors: Chandrika Khanduri and Cody De Arkland – May 20 2026

What was claimed?

Railway announced that a Google Cloud Platform (GCP) account suspension caused a platform‑wide outage lasting roughly eight hours. The statement emphasized that the suspension was an incorrect automated action affecting many customers, and that Railway’s own architecture had “high‑availability” components that should have insulated users from a single provider failure.

What is actually new?

Account‑level suspension as a failure mode – Most post‑mortems focus on node or zone failures. Here the entire cloud‑provider account became inaccessible, which instantly cut off compute, networking, and persistent storage.
Cache‑driven cascade – Railway’s edge proxies cache routing tables from a control‑plane API hosted on GCP. When the cache expired, the edge could no longer resolve routes, causing workloads on AWS and Railway Metal to return 404s even though their compute instances remained up.
Cross‑cloud dependency on a single API – The network mesh (Metal ↔ GCP ↔ AWS) was intact, but the service‑discovery layer was a single point of failure because it lived exclusively in GCP.
Secondary throttling from GitHub – The outage generated a burst of retry traffic that hit GitHub’s rate limits, temporarily breaking OAuth and webhook flows.

What are the concrete limitations?

Area	Limitation	Why it mattered
Account suspension	No graceful degradation path when a provider disables the entire account.	All GCP‑hosted resources (VMs, disks, networking) became unavailable simultaneously.
Control‑plane placement	Discovery API ran only on GCP instances.	Edge caches expired after ~30 min, forcing the mesh to depend on a dead service.
Recovery granularity	Restoring the account did not automatically bring back disks, networking, or compute.	Engineers had to orchestrate multiple independent restores, extending downtime.
Rate‑limit handling	No back‑off strategy for bulk retries to third‑party services.	GitHub throttled OAuth/webhooks, compounding user‑login failures.

Timeline (UTC)

Time	Event
22:10	Monitoring alerts fire on API health‑check failures.
22:11	Dashboard returns 503; users cannot log in.
22:19	On‑call identifies GCP account suspension as root cause.
22:22	P0 ticket opened with Google; account manager engaged.
22:29	Incident declared; GCP access restored but compute stays stopped.
22:35	Edge cache expires; AWS/Metal workloads start returning 404.
23:09	First persistent disk becomes reachable.
23:54	All disks in ready state; network still down.
01:30	Compute instances begin to recover.
01:38	Edge traffic resumes as routing tables repopulate.
02:47	GitHub begins rate‑limiting OAuth/webhook calls.
04:00	API, dashboard, and OAuth endpoints confirmed operational.
06:14	Incident moved to monitoring.
07:58	Formal resolution declared.

How the recovery unfolded

Account re‑enable – Google lifted the suspension, but the underlying resources remained in a stopped state. Persistent disks required a separate “ready” transition, which only completed after 23:54.
Networking restoration – GCP’s VPC and firewall rules had to be rebuilt. Until the VPC was functional, edge proxies could not fetch fresh routes, so the mesh remained effectively blind.
Staggered compute bring‑up – To avoid a thundering‑herd effect on the orchestration layer, instances were started in batches. This prevented further cascading failures but added latency.
Deploy pipeline pause – Build and deployment workers were throttled to keep the restored services from being overwhelmed by the backlog of queued jobs.
GitHub back‑off – Engineers introduced exponential back‑off on OAuth/webhook retries once the rate‑limit response was observed, allowing the GitHub API to recover.

Preventative measures (in progress)

Measure	Status	Expected impact
Decouple service discovery – Replicate the control‑plane API across AWS and Metal, with a consensus layer for routing data.	Design phase; prototype in staging (Q3 2026).	Removes the hard GCP dependency; edge caches can fall back to another region without losing route data.
Cross‑cloud quorum for databases – Extend high‑availability shards to AWS and Metal so that a full‑cloud loss still preserves quorum.	Pilot on non‑critical tables (Q4 2026).	Guarantees write availability even if an entire cloud disappears.
Vendor‑agnostic data‑plane – Shift GCP services (e.g., Cloud SQL, Filestore) out of the hot path; keep them as failover only.	Architecture review completed; migration plan due early 2027.	Limits blast‑radius of any single‑provider outage.
Improved rate‑limit handling – Centralised retry manager with exponential back‑off for all third‑party integrations.	Added to CI/CD pipeline (released Sep 2026).	Prevents secondary throttling cascades during recovery.
Account‑suspension alerting – Subscribe to GCP’s Billing & Account notifications and add a custom health check that verifies API‑key validity every few minutes.	In testing; rollout planned Oct 2026.	Early detection of account‑level actions before they affect workloads.

Takeaways for practitioners

Treat provider‑level account status as a first‑class failure mode. Most HA designs assume the cloud will keep the account active; an automated suspension breaks that assumption.
Separate discovery from data plane. Caching is useful, but the source of truth must be replicated across zones and clouds, otherwise cache expiry becomes a single point of failure.
Design for graceful degradation, not just failover. When a control plane disappears, edge nodes should continue serving stale routes for a bounded period rather than dropping traffic entirely.
Build retry hygiene into every external integration. A sudden surge of retries can trigger rate limits on services you do not control, amplifying the outage.
Post‑incident verification matters – Railway’s engineers are still awaiting confirmation from Google on whether the networking delay was provider‑side. Independent verification (e.g., synthetic probes) should be part of the closure criteria.

Resources

Official GCP status page: https://status.cloud.google.com
Railway’s public incident tracker (archived): https://railway.app/incidents/2026-05-19
Best practices for multi‑cloud service discovery: https://cloud.google.com/architecture/multi-cloud-service-discovery
GitHub rate‑limit documentation: https://docs.github.com/en/rest/overview/resources‑in‑the‑rest-api#rate‑limiting

Railway’s engineering team acknowledges full responsibility for the architectural choices that allowed a single upstream action to cascade platform‑wide. The outlined mitigations aim to make the control plane truly multi‑cloud and to ensure that future provider‑level incidents remain isolated to the affected cloud only.

#Cloud #Multi-Cloud #incident #Service Discovery #rate-limiting

Post‑mortem: How an Automated GCP Suspension Propagated into an 8‑hour Railway Outage

Post‑mortem: How an Automated GCP Suspension Propagated into an 8‑hour Railway Outage

What was claimed?

What is actually new?

What are the concrete limitations?

Timeline (UTC)

How the recovery unfolded

Preventative measures (in progress)

Takeaways for practitioners

Resources

Comments