Treat API Timeouts as “Unknown” – A Pragmatic State‑Machine Approach
#Backend

Treat API Timeouts as “Unknown” – A Pragmatic State‑Machine Approach

Backend Reporter
5 min read

A timeout does not equal failure. By adding an explicit “unknown” state to payment‑gateway and other external‑call workflows, systems avoid blind retries that cause double‑charges, inconsistent data, and lost trust. The article explains the five‑state machine, its impact on consistency models, and trade‑offs for scalability and API design.

Treat API Timeouts as “Unknown” – A Pragmatic State‑Machine Approach

Featured image

When a downstream service stalls, the instinctive response is to mark the request as a failure and retry. In payment processing that habit can turn a single purchase into a double‑charge, and the same pattern repeats across any unreliable external API. The root problem is a missing state: unknown.


The problem: timeouts masquerade as failures

  1. Hidden success – The remote system may have completed the operation but failed to send a response before the client timed out.
  2. Blind retry – Client code interprets the timeout as a failure and resends the request, potentially causing a second execution of the same side‑effect.
  3. Inconsistent data – Downstream systems end up with divergent views of the transaction, breaking eventual consistency guarantees.
  4. Operational noise – Alerting systems fire on every timeout, yet the real issue (an ambiguous outcome) is buried under generic "failure" metrics.

In a payment gateway this translates directly to a customer seeing two charges for one order. In other domains – blockchain RPCs, AI model serving, message‑queue acknowledgements – the same ambiguity can corrupt state or waste resources.


A solution: make ambiguity an explicit state

Instead of a binary pending → success/failure flow, introduce a five‑state finite state machine (FSM):

State Meaning
pending Request sent, awaiting any response.
succeeded Remote system confirmed success (e.g., payment receipt).
failed Remote system confirmed failure (e.g., declined card).
retryable Temporary error that is safe to retry (e.g., 502, rate‑limit).
unknown Timeout or ambiguous response – outcome cannot be determined.

When a timeout occurs, the FSM transitions to unknown instead of failed. The system then:

  • Stops automatic retries.
  • Flags the transaction for manual investigation or a compensating workflow.
  • Emits a distinct metric (api_timeout_unknown) that can be monitored separately from genuine failures.

The open‑source Rust gateway Azums implements this pattern. Its source is available on GitHub: https://github.com/BlockForge-Dev/Azums.


Consistency implications

Strong vs. eventual consistency

  • Strong consistency requires that all replicas agree on the final state before the client proceeds. Treating a timeout as unknown forces the client to wait for a definitive answer, aligning with strong consistency at the cost of latency.
  • Eventual consistency tolerates temporary divergence. By persisting the unknown state, downstream services can continue processing other requests while a reconciliation job later resolves the ambiguity (e.g., by querying the provider’s audit log).

Idempotency guarantees

The unknown state encourages developers to design idempotent endpoints. If a request can be safely repeated without side effects, the retryable state can be used; otherwise, the system must fall back to manual review. This separation reduces the need for ad‑hoc “duplicate‑check” code scattered throughout the codebase.


API design patterns that support the FSM

  1. Explicit status endpoints – Provide a /transactions/{id}/status route that returns one of the five states. Clients can poll this endpoint instead of assuming success after a timeout.
  2. Webhook confirmations – Let the provider push a definitive succeeded or failed event. If the webhook never arrives, the transaction stays in unknown.
  3. Correlation IDs – Include a unique X-Request-ID header on every outbound call. When reconciling an unknown transaction, the system can search logs on the provider side using this ID.
  4. Compensating actions – Define a cancel or reverse endpoint that can be invoked only when the transaction is in unknown and the business logic decides a reversal is safe.

Trade‑offs and scalability considerations

Aspect Benefit of “unknown” state Cost / Trade‑off
Reliability Prevents irreversible side‑effects (e.g., double‑charges). Requires additional monitoring and manual or automated reconciliation pipelines.
Latency Allows the system to pause instead of hammering a flaky provider. Client‑perceived response time may increase if the caller waits for a definitive answer.
Throughput Reduces wasted retries, freeing capacity for healthy calls. Storing and processing unknown records adds I/O overhead; must be sharded or partitioned for large volumes.
Complexity Clear state model simplifies reasoning about edge cases. Developers must handle a new state in UI, reporting, and downstream services.

In high‑throughput environments (e.g., a marketplace processing thousands of payments per second), the unknown bucket can be partitioned by time window and processed by a background worker pool. This keeps the hot path fast while still guaranteeing eventual resolution.


Beyond payments: other domains that benefit

  • Blockchain RPCs – A node may accept a transaction, broadcast it, and then drop the connection. Mark the call unknown and later verify inclusion via a block explorer.
  • AI model serving – Large language model endpoints sometimes exceed the client timeout while still generating a response. An unknown state prompts a follow‑up fetch using the request ID.
  • Message queues – When an acknowledgement is lost, the producer can treat the send as unknown and rely on a deduplication key on the consumer side.

Getting started with the pattern

  1. Define the FSM in your service schema (e.g., a Rust enum, a TypeScript union, or a PostgreSQL CHECK constraint).
  2. Instrument timeouts – Wrap every external call with a timeout handler that writes an unknown record on expiry.
  3. Expose status APIs – Let callers query the current state instead of guessing.
  4. Build reconciliation workers – Periodically reconcile unknown entries against provider logs or audit tables.
  5. Monitor distinct metrics – Separate api_timeout_unknown from api_failure to avoid alert fatigue.

Conclusion

Treating a timeout as an unknown state forces the system to acknowledge its lack of knowledge instead of making a dangerous assumption. The approach adds a small amount of operational complexity but pays off in data integrity, customer trust, and reduced noisy retries. Whether you are building a payment gateway, a blockchain bridge, or an AI‑powered microservice, make ambiguity a first‑class citizen in your API contracts.


Share your worst timeout‑induced disaster in the comments – the lessons learned are often the best source of improvement.

Comments

Loading comments...