What a Go to Rust Webhook Rewrite Really Changes

A webhook platform rewrite from Go to Rust is less about language taste and more about tightening latency, memory behavior, and failure handling in a system where retries, ordering, and partial success decide whether customers trust the API.

Problem

Webhook delivery looks simple until volume turns every edge case into production behavior. An event arrives, the platform signs a payload, sends an HTTP request to a customer endpoint, records the result, and retries failures. That description hides the actual distributed systems problem: the sender does not control the receiver, the network is unreliable, success is often ambiguous, and retry logic can either protect the system or multiply the incident.

The reported rewrite started from a Go service that handled roughly 50,000 webhooks per day comfortably, then began showing pressure around 200,000 daily deliveries. Idle memory sat near 2.8 GB, peak memory climbed above 6 GB during spikes, and p99 latency reached 340 ms. For many internal APIs, that would be annoying. For a webhook delivery platform, it changes the shape of the system.

Webhook infrastructure is usually fan-out heavy. One upstream event can produce many outbound requests. Those requests can be slow, fail intermittently, return misleading status codes, or timeout after already processing the event. If the delivery service holds too much memory per in-flight request, a receiver slowdown can become a sender outage. If latency tails widen, retry queues grow. If retry queues grow without backpressure, the system starts competing with itself.

Go is often a good fit for this kind of service. Goroutines, channels, the standard HTTP stack, and simple deployment make it easy to build concurrent backends. The problem is not that Go cannot run webhook systems. Plenty of teams do that successfully. The issue is that Go makes concurrency inexpensive to start, while correctness still depends on careful ownership of state, cancellation, timeout handling, and result recording.

The most concerning failure in the original account was not memory usage. It was a race where deliveries could be marked successful before the response body was fully read. That kind of bug is familiar to anyone who has operated delivery systems. It creates false confidence. The database says the message is done, while the real external interaction is still incomplete. Once that state is committed, repair becomes hard because the system has already thrown away the evidence it needs to make a better decision.

That is a consistency problem, not just a code quality problem. The delivery system needs a clear model for when an attempt becomes durable truth. Is success defined as receiving a 2xx status? Receiving and draining the full response body? Persisting the result after the response lifecycle completes? Updating metrics after the state transition? These details decide whether the platform is at-least-once, effectively-once for idempotent consumers, or simply optimistic under load.

Solution Approach

The team moved the core delivery path to Rust first, then the API layer, then migrated traffic gradually. That sequencing matters. Rewrites fail when they replace too much surface area before proving the riskiest assumption. In this case, the riskiest part was not request routing or CRUD endpoints. It was the hot path: HTTP delivery, HMAC signing, retry classification, timeout behavior, and durable result recording.

Rust changes the engineering constraints in a useful way. Its ownership model removes garbage collection from the runtime profile, which can help with latency predictability under high concurrency. Its type system also forces more explicit state modeling. A webhook attempt can be represented as success, retryable failure, permanent failure, timeout, signature error, endpoint configuration error, or internal error. Those states should not be loose strings passed across the system. They should be types that make illegal transitions difficult.

For example, a delivery function returning a structured result like Result<DeliveryResult, DeliveryError> forces the caller to distinguish transport errors from application-level HTTP results. A 429 Too Many Requests is not the same thing as a malformed URL. A 500 from a receiver should usually enter a retry path. A 401 probably means configuration is wrong and repeated retries will waste capacity. A timeout is ambiguous because the receiver might have processed the request before the sender gave up.

That ambiguity is where API design and distributed systems meet. A serious webhook platform should encourage idempotency on the receiving side and provide stable event identifiers in headers. A common pattern is to send headers such as X-Event-Id, X-Delivery-Id, X-Signature, and X-Timestamp, then document that receivers should deduplicate by event or delivery identifier. Stripe popularized this style in its webhook documentation, and the pattern exists because exactly-once delivery across HTTP boundaries is not a promise a sender can honestly make.

The more realistic contract is at-least-once delivery with bounded retries and clear observability. That means the sender may deliver the same event more than once, especially around timeouts, connection resets, deploys, and queue replays. The API should make that safe. A receiver should be able to treat repeated deliveries as normal, not exceptional.

Rust does not solve that contract by itself. It does make it easier to encode the local rules cleanly. A retry classifier can be a function over typed outcomes. A delivery state machine can prevent a transition from Pending directly to Succeeded unless an attempt result exists. A database update can be tied to a completed attempt rather than a partially borrowed response object. The compiler will not design the state machine, but it will enforce many of the boundaries once the design is expressed in types.

The reported stack used Tokio for async execution and Axum for the REST API. That is a common Rust backend combination. Tokio provides the async runtime, task scheduling, timers, and I/O primitives. Axum builds on the Tower ecosystem, which is useful for middleware, timeouts, tracing, and service composition.

The migration strategy was also sensible: run the Go and Rust paths side by side, compare outputs, then shift traffic in stages. For a webhook platform, comparison should include more than status codes. It should compare signature bytes, canonical payload representation, retry decisions, timeout classification, request headers, and database state transitions. Small mismatches can become customer-visible breakage because webhook receivers often validate signatures strictly.

A staged migration also gives the team a chance to test backpressure. The hard question is not whether the Rust service is faster on a clean benchmark. The hard question is how it behaves when 10 percent of customer endpoints slow down, retry queues fill, DNS resolution stalls, or downstream storage starts timing out. Good migration plans include shadow traffic, controlled rollout, error budget checks, and a rollback path that does not corrupt delivery state.

What Changed

The reported numbers are large: idle memory dropped from 2.8 GB to 380 MB, peak memory from 6.2 GB to 1.1 GB, p99 latency from 340 ms to 38 ms, CPU from 72 percent to 31 percent, and deliveries per second from 2,400 to 8,200.

Those improvements are plausible for a rewrite that tightened allocation behavior, removed garbage collection pauses, changed HTTP client behavior, and cleaned up concurrency patterns. They should still be read as system results, not language results in isolation. A rewrite often changes batching, queue structure, payload copying, JSON handling, connection pooling, and metrics instrumentation. Rust may be the enabling factor, but the operational win usually comes from the whole design being re-examined.

The memory reduction matters because webhook delivery has bursty load. If each in-flight attempt carries too much overhead, the platform needs larger instances just to survive receiver slowness. Lower per-request memory lets the system hold more concurrent attempts before it must shed load. That can reduce cost, but it can also reduce incident frequency because the system has more headroom during retries.

The p99 improvement matters more than the p50. A p50 of 12 ms instead of 45 ms is nice. A p99 of 38 ms instead of 340 ms changes queue dynamics. Tail latency determines how long workers stay occupied. Long tails reduce effective throughput and delay retries. In systems with bounded worker pools, tail latency can look like a capacity problem even when average latency seems fine.

The consistency improvement is harder to quantify but more important. A delivery platform needs a disciplined order of operations. One reasonable flow is: reserve an attempt, build the request, compute the signature over exact bytes, send with a deadline, read enough response data to classify the result, persist the attempt outcome, then publish metrics and schedule any retry. If metrics happen before persistence, dashboards lie during database incidents. If persistence happens before response completion, the platform can mark success too early. If retries are scheduled before durable failure recording, a crash can create gaps or duplicates that are hard to explain.

Rust encourages making those stages explicit. A type like AttemptInProgress can be consumed into CompletedAttempt. A retry scheduler can accept only completed attempts with retryable outcomes. That sounds strict, but strictness is useful in delivery infrastructure. Production incidents often come from states the original design treated as unlikely.

Trade-offs

The first trade-off is engineering speed. Go is fast to write, easy to read, and operationally familiar to many backend teams. Rust has a steeper learning curve. The borrow checker is not just a syntax hurdle. It requires engineers to model ownership, lifetimes, and mutation more carefully than they may be used to. That cost is real during the first weeks of a migration.

The second trade-off is ecosystem maturity. Go has a broad backend ecosystem, a strong standard library, and simple build behavior. Rust has excellent components, including reqwest, hyper, serde, Tokio, and Axum, but teams may still find gaps around operational glue. Sometimes the right answer is to write a small internal library, such as an exponential backoff scheduler with jitter. That can be fine, but it creates ownership cost.

The third trade-off is build complexity. Rust compile times can become painful in CI, especially for large async services with many dependencies. Tools like sccache and cargo check help, but they do not erase the difference. If a team optimizes only for local iteration speed, Go has a strong advantage.

The fourth trade-off is operational familiarity. Debugging async Rust requires understanding task scheduling, cancellation, lock scope, and instrumentation. A service can still deadlock logically even when the compiler prevents data races. Holding an async mutex across an await point, saturating a runtime, or misconfiguring connection pools can still produce production incidents. Rust removes some classes of memory and data-race bugs. It does not remove the need for load testing, tracing, chaos testing, or careful queue design.

The fifth trade-off is migration risk. Rewriting a service that signs payloads and talks to customer infrastructure can break compatibility in subtle ways. JSON field ordering, timestamp formats, HMAC canonicalization, header casing expectations, timeout defaults, and redirect behavior can all matter. A correct migration plan treats the old implementation as a behavioral specification until the new one proves equivalence where customers depend on it.

Broader Pattern

The lesson is not that every Go webhook service should become Rust. The better lesson is that delivery platforms need mechanical sympathy with their failure modes. If the workload is small, Go may be the right tool because the team can build and operate it quickly. If the workload is high-concurrency, latency-sensitive, and full of tricky state transitions, Rust can pay for itself by reducing runtime uncertainty and forcing sharper models.

For webhook systems, the biggest design questions are independent of language:

What is the delivery guarantee, and is it documented honestly?
Are retries bounded, jittered, and separated by failure class?
Can receivers deduplicate using stable identifiers?
Is success recorded only after the attempt is fully classified?
Does the system apply backpressure when receivers slow down?
Can operators replay safely without creating uncontrolled duplicates?
Are payload signing rules stable across versions?

A language rewrite can create the opportunity to answer those questions again. That is probably why this migration produced such large gains. The team did not merely translate syntax. It changed the core delivery engine, tightened the API layer, and validated behavior during a staged rollout.

Rust was a good fit because webhook delivery is a place where predictable memory, explicit error handling, and typed state transitions directly map to production outcomes. Go remains a good fit for many services, especially when simplicity and team familiarity dominate. The deciding factor is not language identity. It is whether the system's failure modes are costly enough that stronger compile-time constraints and lower runtime overhead justify the extra engineering cost.

For teams considering a similar move, the practical path is to start with the hot path, not the whole platform. Model delivery states explicitly. Compare old and new behavior under shadow traffic. Load test receiver failure, not just happy-path throughput. Treat timeouts as ambiguous. Assume duplicate delivery will happen. Design the API so customers can survive it.

That is the part experience teaches the hard way: webhook systems do not fail only when code crashes. They fail when the platform records the wrong truth, retries without discipline, or hides uncertainty behind a clean status field. Rust can help build a tighter implementation, but the real win comes from using the rewrite to make the distributed system more honest.

#webhooks #Rust #Performance #migration #distributed systems