Latency Is Not a Performance Problem. It Is a Design Problem
#Infrastructure

Latency Is Not a Performance Problem. It Is a Design Problem

Backend Reporter
4 min read

The article argues that latency in distributed systems stems from architectural design choices rather than performance optimization opportunities. It explains how microservices expose inherent latency, why asynchronous patterns don't eliminate waiting, and how fan-out patterns amplify tail latency. The piece emphasizes that humans perceive latency differently than machines measure it, and that the fastest systems eliminate work from the request path rather than optimizing it.

For more than twenty years the industry has treated latency as something to optimize. Faster CPUs. Better frameworks. Smarter garbage collectors. More aggressive caching. Parallel execution. Async everywhere. Yet systems still feel slow. This happens because latency is not primarily a performance problem. It is a design problem that performance tools cannot fix.

Performance assumes the system shape is correct and execution is inefficient. Latency appears when the system shape itself forces waiting. Once waiting exists in the design, no amount of optimization removes it. You can only hide it.

Distributed systems do not fail because components are slow. They fail because time is forced to flow through too many places.

Why latency compounds across services

In a monolithic system a function call is cheap. Memory is shared. Execution remains inside a single scheduler. Time mostly behaves.

In distributed systems every boundary introduces waiting. A service call is not a method call. It is a negotiation between machines. Before a single line of business logic executes, the system must perform:

  • Thread scheduling and kernel transitions
  • Serialization and deserialization
  • Network buffering and TCP flow control
  • Routing and remote queuing

None of this is free. When several services are chained, latency compounds—not additively, but statistically. Percentiles do not compose. If five services each have a p95 latency of 40ms, the combined request is not 200ms. The slowest tail dominates. One cold cache or one GC pause decides the outcome for the whole chain.

This is why systems appear fast in metrics but feel slow in reality.

Microservices did not create latency. They exposed it. Latency existed long before microservices; monoliths simply hid it inside memory calls. Microservices externalized it. Every design shortcut that once lived quietly inside a codebase became observable across the network. Tight coupling turned into synchronous dependencies. Over-normalized logic turned into chatty communication. Microservices did not make systems slower. They made architectural mistakes measurable.

Why async does not fix latency

Async improves throughput. It does not reduce time. If a remote call takes 120ms, making it asynchronous does not make it faster. It only allows the thread to do something else while waiting. The wall clock still moves at the same speed.

Async changes who waits. It does not remove waiting. Most systems still contain a synchronization point where all required data must be available before responding. That moment defines perceived latency. Everything before it is irrelevant.

Fan out: The most dangerous latency pattern

Fan out is seductive. One request becomes many parallel calls with aggregation at the end. On diagrams, this looks scalable. In reality, response time becomes equal to the slowest downstream dependency. At scale, something is always slow. A network jitter, a hot shard, or a thread pool briefly exhausted.

When a request fans out to ten services, you are designing for tail latency amplification. This is why mature systems aggressively collapse reads, precompute views, and accept duplication. They do it because latency is more expensive than redundancy.

Humans perceive latency before machines measure it

Machines measure latency numerically. Humans experience it neurologically:

  • 100-150ms: Interaction stops feeling instantaneous
  • 300ms: Delay becomes noticeable
  • 1 second: Cognitive flow breaks
  • 2-3 seconds: Trust begins to erode

A backend team may celebrate a 250ms p95, but users already feel friction. Latency is not just time—it is a loss of agency.

The request path is sacred

Latency rarely comes from language choice. It comes from synchronous dependency chains and computing truth during the request instead of before it. Fast systems move work out of the request path. They compute earlier. They cache aggressively. They pre-join data. They accept eventual consistency deliberately.

The oldest rule still holds: You can be correct or you can be fast. If a system must compute truth at the moment a human is waiting, the system is already late. The heavy thinking should have already happened. The request should be retrieval, not discovery.

Conclusion: Design Short, Not Fast

Junior thinking asks how to make this faster. Senior thinking asks why this request exists at all. Latency drops most when work is removed, not accelerated. The fastest network call is the one never made. The fastest query is the one computed yesterday.

Latency is not a bug. It is feedback. It tells you where your system is thinking too late. Once you see that difference, you stop fighting latency. You design around it.

Featured image

Comments

Loading comments...