When DNS Becomes the Silent Killer: PayTrack’s Backend Outage
#Infrastructure

When DNS Becomes the Silent Killer: PayTrack’s Backend Outage

Backend Reporter
4 min read

PayTrack’s Flask backend lost connectivity to Supabase not because of code bugs but due to ISP‑level DNS routing problems. Switching to public resolvers restored service, highlighting how infrastructure layers outside the application can become the real source of failure.

When DNS Becomes the Silent Killer: PayTrack’s Backend Outage

During a routine sprint on PayTrack, the team hit a wall that looked like a classic Flask‑Supabase integration bug. All request logs showed timeouts, CORS headers were correct, credentials were valid, and local unit tests passed. Yet the production backend refused to talk to Supabase.


The Problem: A Black‑Box Connectivity Failure

Symptom Initial Hypothesis
requests.exceptions.ConnectTimeout from Flask → Supabase Bad Flask config or missing env vars
200‑OK from health endpoint, but 504 from API calls Network firewall or VPC rule
No error in Supabase logs Issue on client side

The team cycled through the usual suspects:

  • Verified Flask’s SESSION_COOKIE_SAMESITE and CORS headers.
  • Re‑generated Supabase service role keys and re‑added them to the environment.
  • Checked outbound security groups and NAT gateway rules.
  • Ran curl from the container shell; it hung forever.

All of those checks cleared. The code path that performed the HTTP request was identical to the one that worked in staging. The mystery deepened.


The Real Culprit: ISP DNS Routing

A quick dig supabase.co from the production host returned an IP address that did not belong to Supabase’s published range. Tracing the query showed the ISP’s recursive resolver was returning stale or hijacked records. When the team switched the server’s /etc/resolv.conf to use Google’s public DNS (8.8.8.8) or Cloudflare’s (1.1.1.1), the dig output matched the official Supabase endpoints and the Flask app immediately re‑established connectivity.

Key takeaway: DNS resolution happens before any HTTP request is made. If the resolver hands back the wrong IP, the application never gets a chance to retry or log a meaningful error.


Solution Approach

  1. Validate DNS at startup – Add a health‑check that resolves critical third‑party domains and compares the result against an allow‑list. If the check fails, the service can abort early and alert ops.
  2. Pin to reliable resolvers – Configure the container runtime or VM to use known public DNS servers instead of the default ISP resolver. This can be done via Docker’s --dns flag or by setting resolvconf in the OS image.
  3. Cache with fallback – Use a DNS caching library (e.g., dnspython) that retries with an alternative resolver on failure, reducing the impact of a single bad DNS hop.
  4. Monitor DNS latency – Tools like Prometheus + node_exporter expose dns_query_duration_seconds metrics, allowing alerts when resolution times spike.

Trade‑offs and Considerations

Aspect Benefit Cost
Hard‑coding public DNS Immediate mitigation of ISP‑level issues; predictable resolution path. Loss of locality – queries may travel farther, adding a few milliseconds of latency.
Application‑level DNS validation Early detection of mis‑routing; can trigger automated rollbacks. Extra startup time; requires maintenance of an allow‑list that can become stale.
External DNS monitoring Provides visibility into global resolver health; can surface ISP outages before they affect users. Additional infrastructure overhead; alerts can become noisy if not tuned.

In PayTrack’s case, the simplest win was to point the production hosts at Cloudflare’s DNS. The latency impact was negligible (<2 ms on average) compared to the minutes of downtime saved.


Broader Implications for Distributed Systems

PayTrack’s incident is a reminder that networking layers are first‑class citizens in any distributed architecture. While we often focus on consistency models, sharding strategies, or API versioning, the reliability of name resolution sits at the foundation of every remote call.

  • Consistency vs. Availability: DNS failures manifest as availability problems even when the underlying service is perfectly consistent. Designing fallback resolvers improves availability without touching data consistency.
  • Observability: Traditional request‑level tracing (e.g., OpenTelemetry) will not capture a DNS failure because the request never leaves the client. Adding DNS metrics fills that blind spot.
  • Operational hygiene: Treat DNS configuration like any other critical dependency. Store resolver settings in version‑controlled infrastructure code (Terraform, Ansible) and test them in CI pipelines.

Action Items for Teams

  1. Audit your production DNS settings – Ensure they point to reliable resolvers and are not inherited from a flaky ISP.
  2. Add DNS health checks to your CI/CD pipelines; a failing dig should break the build.
  3. Instrument DNS latency in your monitoring stack and set alerts for outliers.
  4. Document fallback procedures so on‑call engineers can switch resolvers quickly when an ISP outage is suspected.

Conclusion

The PayTrack outage underscores a simple truth: software rarely fails in isolation. The stack that delivers a request includes the OS, the network stack, and the DNS infrastructure. By extending our debugging mindset beyond the application code and treating DNS as a critical dependency, we can avoid silent outages that waste hours of investigation.


Featured image

Figure: A typical request flow – the DNS lookup sits at the very start, before any HTTP traffic leaves the host.

Comments

Loading comments...