Microsoft pg_durable Brings Durable Workflows Inside PostgreSQL
#Infrastructure

Microsoft pg_durable Brings Durable Workflows Inside PostgreSQL

Serverless Reporter
7 min read

Microsoft’s pg_durable reframes workflow orchestration as database-local execution, trading external control planes for checkpointed SQL graphs that live beside the data they coordinate.

Featured image

Service Update

Microsoft has open-sourced pg_durable, a PostgreSQL extension for running durable workflows directly inside the database. The project targets a familiar architecture problem: teams often combine cron, app-tier workers, queues, retry tables, and workflow services just to run long-lived data tasks safely. pg_durable moves that coordination into PostgreSQL, where workflow state, checkpoints, retries, and progress tracking are persisted alongside the operational data.

The model is simple to describe. A workflow is expressed as a graph of SQL steps. PostgreSQL executes each step, records progress, and resumes from the last durable checkpoint after a crash, restart, failover, or step failure. Microsoft’s framing is that some scheduler glue, queue consumers, and background workers can disappear when the workflow is fundamentally data-local.

The extension exposes a SQL DSL with operators for sequential execution, binding intermediate results, fan-out, joins, scheduling, and conditions. A batch processing flow can select unprocessed document IDs, bind that result to a variable, then update those rows in the next step. Parallel branches can run independent SQL nodes and wait for completion through df.join or the & operator.

Architecturally, pg_durable is intentionally small. It consists of a PostgreSQL extension plus a background worker, with no separate control plane. Under the covers, it uses the Rust libraries duroxide and duroxide-pg for deterministic replay, checkpoints, timers, work queues, persisted history, and instance state. That makes the database the workflow store and execution host, not merely the system of record behind an external orchestrator.

This puts pg_durable in conversation with services such as AWS Step Functions, Azure Durable Functions, Google Cloud Workflows, Temporal, and Cloudflare Workflows. The difference is placement. Those systems coordinate work outside the database and call into storage, APIs, functions, and services. pg_durable starts from the opposite side: if the workflow is mostly SQL, keep the orchestration near PostgreSQL.

There is no separate managed cloud service SKU or announced price cut attached to pg_durable. The pricing change is architectural rather than vendor-published. Instead of paying for a workflow service’s state transitions, scheduler invocations, queue requests, function runtime, and worker fleet capacity, teams shift more of that work into PostgreSQL CPU, storage, write-ahead logging, backups, replication, and I/O. That can reduce service sprawl for database-heavy workloads, but it can also move orchestration cost onto the most critical tier in the system.

For teams already running PostgreSQL as the center of an application, that shift is attractive. For teams using managed Postgres through Azure Database for PostgreSQL, Amazon RDS, AlloyDB, or Cloud SQL, adoption will depend on whether the extension is supported in the target environment, whether background workers are allowed, and whether operational policies permit custom extensions. In self-managed Postgres, the path is more direct, but the operational burden is also fully yours.

Use Cases

The clearest fit is data-local orchestration. Consider a vector embedding pipeline using pgvector. An application inserts documents into PostgreSQL. A workflow finds unprocessed rows, chunks content, calls an embedding API, stores vectors, updates status, and retries failed items. In a conventional design, that might require a scheduler, a queue, an embedding worker, a retry table, and cleanup jobs. With pg_durable, the control flow can be represented as checkpointed SQL steps, while the actual external API call remains the boundary that needs careful timeout and idempotency handling.

Microsoft Open-Sources PostgreSQL Extension for In-Database Durable Execution - InfoQ

This pattern also fits database maintenance workflows. A durable function could inspect table bloat, calculate candidates for vacuum or reindex operations, send a notification, wait for approval, and run follow-up SQL during a maintenance window. The value is not that PostgreSQL suddenly replaces every operations platform. The value is that the workflow state does not have to be reconstructed from logs, queue visibility timeouts, or half-updated status columns after an interruption.

The extension is also relevant to event-driven systems that already use PostgreSQL as a coordination point. Many teams implement a transactional outbox table so business changes and outbound events commit together. A pg_durable workflow could process the outbox, group events, call downstream APIs, and mark delivery checkpoints in the same database. That can simplify smaller systems where Kafka, a workflow engine, and multiple workers are too much machinery for the problem.

A FaaS architecture can still be part of the design. For example, an API request might write a job row to PostgreSQL, pg_durable might coordinate the state machine, and individual steps might call Azure Functions, AWS Lambda, or internal HTTP services for compute-heavy work. In that layout, the database owns durable control flow while functions remain stateless executors. This is a useful split when orchestration must be close to data consistency, but compute should scale independently.

The same idea applies to AI agent and automation workloads. Long-running agent flows often need memory, checkpoints, retries, timers, external calls, and human approval steps. A database-native workflow can be a practical control plane for constrained enterprise automations, especially when the source of truth already lives in relational tables. The SQL graph becomes the audit trail of what happened, what completed, and what must resume.

Integration patterns will matter more than syntax. A team could use pg_durable for intra-database workflows, pair it with LISTEN/NOTIFY for lightweight event signaling, combine it with a transactional outbox for external publication, or use it as the retry layer behind webhook delivery. The best designs will keep each step idempotent, store external request identifiers, and make compensation explicit where a remote API cannot be rolled back with the database transaction.

This is where pg_durable’s appeal becomes architectural. It does not ask every workload to become a stored procedure. It gives database-centered workflows a durable execution substrate so teams can avoid rebuilding the same retry and resume logic in each service. For applications where PostgreSQL is already the source of truth, that can make the system easier to reason about.

Trade-Offs

The main benefit is locality. Workflow progress is stored in PostgreSQL, the data being processed is in PostgreSQL, and many steps are SQL. That reduces impedance between state, queries, retries, and auditability. A failed workflow can resume from a checkpoint rather than asking operators to infer the last safe step from application logs and partial writes.

The cost is coupling. Putting orchestration into the database increases the importance of database capacity planning, extension lifecycle management, monitoring, and backup strategy. A workflow engine outside the database can fail or scale independently. A database-local workflow engine competes with application queries for CPU, locks, connection slots, storage bandwidth, and replication throughput. If the workflow is noisy, the blast radius includes the primary data store.

There is also a portability question. SQL is portable in theory, but PostgreSQL extensions, background workers, and custom operators are specific in practice. A system built deeply around pg_durable will not move unchanged to MySQL, DynamoDB, Spanner, or a managed Postgres plan that forbids the extension. That may be acceptable for teams already committed to PostgreSQL, but it should be treated as a platform decision rather than a library choice.

External API calls require particular care. Durable execution can retry a failed step, but it cannot make an arbitrary remote side effect transactional. If a payment provider, embedding API, ticketing system, or notification service receives a request and the database crashes before recording success, the workflow may call again. The answer is standard distributed-systems discipline: idempotency keys, deduplication tables, explicit status records, and clear compensation logic.

Observability also changes. In an external orchestrator, execution history, retries, timers, and failed activities are usually first-class UI concepts. With pg_durable, those signals live closer to database tables, extension views, logs, and metrics. Teams will need dashboards that expose queue depth, failed workflow instances, retry counts, long-running steps, checkpoint age, lock waits, and WAL growth. Without that, the design can become easier to write but harder to operate.

The comparison with managed workflow services is not one-sided. AWS Step Functions pricing, Azure Functions pricing, Google Cloud Workflows pricing, and queue pricing are visible, metered, and separable from database load. pg_durable may remove some of those line items, but it can increase the hidden cost of database headroom. For a high-volume workflow, the right question is not only which service is cheaper per operation. The better question is which tier should absorb orchestration pressure.

A pragmatic adoption path starts with bounded workflows. Document processing, recurring maintenance, outbox draining, embedding backfills, and approval flows are good candidates because their state is already relational and their failure modes are understandable. Latency-sensitive, high-fan-out, cross-service workflows may still belong in Temporal, Step Functions, Durable Functions, or another external coordinator with dedicated scaling and visibility.

pg_durable is a sign that durable execution is becoming a design primitive, not only a hosted service category. The interesting part is not that PostgreSQL can run more logic. PostgreSQL has always invited that debate. The interesting part is that the extension treats workflow progress itself as durable data, then gives SQL authors a way to compose that progress without surrounding the database with a nest of small control services.

Comments

Loading comments...