Why running engineering teams at 100% capacity makes them slower, not faster
#Business

Why running engineering teams at 100% capacity makes them slower, not faster

Backend Reporter
7 min read

Charity Majors launched her Stack Overflow advice column with a question every senior engineer has stewed over: how do you convince leadership that slack isn't slacking off? Her answer reframes the whole fight, and the systems analogy at its core is one any distributed systems engineer will recognize immediately.

Charity Majors, co-founder and CTO of Honeycomb, kicked off a new advice column on the Stack Overflow blog with a question that lands squarely on one of the oldest unsolved problems in software organizations: how does an individual contributor convince leadership to stop pushing teams past their breaking point? The letter writer, a senior staff engineer signing off as S.L.A.C.K.E.R., described teams that have felt unsustainably loaded for years and asked how to argue for building slack into the system without it sounding like an argument for doing less.

The response is worth reading not because the advice is novel in the abstract, but because Majors reframes the entire problem in terms that anyone who has operated a production system will recognize. The core insight is one that queueing theory has formalized for decades, and watching it applied to human teams is clarifying.

Featured image

The problem: speed is the only word anyone agrees on

Majors makes a sharp observation early on. When engineering organizations talk about their work, the vocabulary collapses down to two words: fast and slow. Everyone agrees slow is bad and fast is good, which sounds like consensus but is actually the opposite of a useful conversation. The terms flatten a multidimensional systems problem into a one-dimensional moral judgment, and once you have reduced something to a moral judgment, the discussion is over.

This is a failure mode familiar from operating distributed systems. "The service is slow" is not an actionable signal. It is the human equivalent of a single latency number with no percentile breakdown, no trace, and no indication of where in the call graph the time is being spent. You cannot fix what you cannot describe, and you cannot describe a complex system with two adjectives.

Her advice to the letter writer is to stop saying "we need to slow down" or "this pace is unsustainable." The moment you say it, you become the person arguing for doing less, and you lose. Everyone in the room, the engineers included, wants to ship faster. The disagreement is about what faster actually means and how you get there.

The solution: build a richer vocabulary for the work

Majors proposes borrowing terms with more specificity. She likes "clock rate," how long it takes to ship a change and how much friction is in the path, and "saturation," how much of your capacity a given set of goals consumes in the allotted time. Both are deliberately mechanical, and that is the point. They invite measurement and tradeoff discussions instead of moral ones.

She also points to Jack Danger's framing of technical debt as financing, which decomposes debt into principal, interest rate (the energy lost just tolerating it), growth in principal, growth in interest, and payoff events (future inflection points that force payment in full). The value of this model is that it turns "we have tech debt" into a set of quantities you can actually reason about and prioritize against. For teams that want to go further, she gestures at the usual instrumentation: DORA metrics, the SPACE framework, and commercial tools like Swarmia and DX. The specific framework matters less than the move from adjectives to measurable dimensions.

The trade-off: saturation versus throughput

The heart of the column is an analogy that any engineer who has watched a system fall over will feel in their gut. A team loaded to 100% capacity does not go fast. It grinds to a halt, the way New York traffic does at rush hour.

This is not a metaphor that breaks down under scrutiny. It is essentially the same dynamic that governs any queueing system. As utilization approaches 100%, queue length and wait time do not grow linearly, they grow asymptotically. A road at 80% capacity can still move cars at the speed limit. Push it past a threshold and any minor perturbation, a brake tap, a merge, a small delay, cascades into gridlock because there is no headroom to absorb it. Latency under load is not a smooth curve, it is a cliff.

The same physics applies to a team. A team running at full saturation has no buffer for the unplanned work that reality reliably generates: the incident, the urgent customer escalation, the dependency that breaks. Without flex in the system, every one of those events forces something else to slip, and the slippage compounds. The organization experiences this as missed dates and constant firefighting, then often responds by applying more pressure, which removes whatever slack remained, which makes the next perturbation worse. It is a positive feedback loop pointed in the wrong direction.

Majors is blunt about what running teams at their melting point looks like in human terms: people working late, skipping exercise, snapping at their families, drinking to fall asleep, crying on Sunday nights. Anyone can sprint into the red occasionally, especially when the work is intrinsically motivating, but no one sustains it indefinitely. And the operational cost is concrete: people cannot step up in a genuine crisis if they have already spent everything keeping up with the everyday. Treating normal engineering as a perpetual emergency guarantees you have no surge capacity when a real emergency arrives. Reserve capacity is not waste. It is the thing that lets the system stay responsive under variance.

Urgency is a feeling, not a plan

Majors separates the levers that actually move delivery speed from the one leaders most often reach for. Throughput is a function of many things: how hard the work is, how much friction surrounds it, how familiar it is, how distracted people are, their intrinsic motivation. "Urgency" addresses none of these. It is emotional pressure, and leaders apply it because it feels like the only lever within reach.

Working longer hours and ratcheting up stress are short-term patches. The durable fixes are structural, and most of them live in platform engineering and developer enablement: reducing the cost of doing the work itself, lowering cycle time, removing the friction that makes every change expensive. These investments take years to pay off and require sustained discipline to land and keep. That timeline is precisely why pressure feels more attractive to a stressed leader, and precisely why it does not work.

The lever that actually works: evidence

The most practical piece of advice in the column is also the most tactical. The letter writer had mentioned that when the inflow of work slows even slightly, their team's velocity goes up and they start doing more interesting, higher-leverage work. Majors seizes on this. That is not a complaint, it is evidence, and it points in exactly the direction leadership claims to want.

Managers are numb to the standard requests. "We need more resources" and "we don't have enough time" arrive constantly, in good times and bad, so they carry almost no signal. (Engineers do the same thing, tuning out "this is an inflection point" and "we feel urgency" once they have heard them enough.) Anything you grow to expect, you stop hearing. But someone arriving with a solution that demonstrably works, backed by observed results from their own team, cuts through the noise. The argument for slack is far stronger when framed as "here is the throughput we measured when the system had headroom" than as "we are tired."

You will not solve this

Majors closes with a dose of realism that keeps the piece from being a tidy manifesto. Right-sizing work and deciding what to build is genuinely one of the hardest problems in the field, and most engineers do not appreciate how hard until they have done it from the management side. She is clear that this is a job for staff-plus engineers as much as for managers, and that experienced engineers are often better positioned to spot and fix certain classes of problems than even technically strong managers. Power tends to drift toward management over time, and she thinks ICs could stand to reclaim some of it.

But there will always be something broken, flaky, or slow. The goal is not a perfectly tuned system, which does not exist, but a system with enough flex to stay fast over the long run rather than fast right up until it seizes. The full column is available on the Stack Overflow blog, and Majors is taking reader questions on observability, careers, leadership, and team dynamics for the limited run.

For anyone who has spent a career watching saturated systems fail in slow motion and then all at once, the lesson transfers cleanly. Headroom is not slack in the pejorative sense. It is the margin that absorbs variance, and a system without it is not running efficiently. It is one perturbation away from gridlock.

Comments

Loading comments...