Steered, Not Replayed: Execution Graphs vs Workflow Graphs
#Infrastructure

Steered, Not Replayed: Execution Graphs vs Workflow Graphs

Backend Reporter
9 min read

The article distinguishes three durability models—replay‑based workflow engines, CRIU‑style snapshots, and live execution‑state graphs—explaining why the emerging need for multi‑operator, human‑AI collaboration demands a steered, owner‑less execution graph rather than deterministic replay or single‑process freeze.

Steered, Not Replayed: Execution Graphs vs Workflow Graphs

Originally published at docs.cmdop.com/blog/execution-state-continuity-03-steered-not-replayed


The problem: surviving failure for a running computation

Modern back‑ends must keep doing work even when networks flap, nodes crash, or downstream services time out. Two broad families of solutions claim to provide durable execution, but they solve fundamentally different problems and are often conflated under the same buzzword.

  1. Replay‑based durable execution – record every decision and side effect in an append‑only journal, then restart the program from its entry point on a fresh worker, feeding the recorded results back in.
  2. Live execution‑state steering – keep a live OS‑level execution graph (process tree, PTY, file descriptors, sockets) alive, allowing multiple operators to attach and mutate the running system.

Both approaches surface an "execution graph" in their UI, yet the semantics of that graph differ dramatically. The article argues that the industry has been missing a third, distinct category that combines the durability of replay with the live‑state visibility required by human‑AI collaboration.


1️⃣ Replay‑based durable execution (Temporal, Cadence, Dapr, Azure Durable Functions)

How it works

  • The runtime splits workflow code (deterministic orchestration) from activities (non‑deterministic side effects).
  • Each activity result is appended to an event history.
  • When a worker crashes, a new worker re‑executes the workflow from the top. The SDK intercepts activity calls, finds the stored result, and returns it instantly – no side effect is repeated.
  • A stable workflow ID lets signals and queries reach the instance regardless of which worker hosts it.

What it gives you

  • Exactly‑once semantics for external calls.
  • "Sleep for 30 days" becomes a cheap timer; the worker can be reclaimed while idle.
  • Scale‑to‑zero, low cost, and the ability to run on any language runtime because no heap or register state is captured.

The price you pay

  • Determinism contract – reading the wall clock, generating randomness, or accessing a mutable DB outside an activity breaks replay and triggers a non‑determinism error.
  • Logical state only – the engine reconstructs a workflow graph (which steps completed, which are pending) but never knows about live OS resources like open sockets, PTYs, or partially‑executed shells.

Replay is excellent for business processes, sagas, and long‑running orchestrations, but it cannot resume a half‑filled interactive session.


2️⃣ CRIU‑style single‑process snapshots

CRIU (Checkpoint/Restore in Userspace) captures a process bit‑for‑bit:

  • Memory pages, CPU registers, file‑descriptor tables, even TCP state via TCP_REPAIR.
  • The snapshot can be restored on another host with a matching kernel/ISA.

What it gives you

  • Faithful recreation of a live process – useful for migration, debugging, or fault‑tolerant containers.

The limitations

  • No control plane – there is no stable identity that survives beyond the snapshot file.
  • Single‑owner – only one restorer can hold the process; no concurrent operators can attach.
  • No logical continuity – the system cannot answer "which step of a business workflow am I on?" because it only knows about raw OS state.

3️⃣ Live execution‑state graphs (the missing middle column)

Definition

An execution graph here is the live OS process tree together with its PTY, file‑descriptors, and socket state. It is held rather than reconstructed, and it is exposed as a first‑class, addressable object that carries a stable identity independent of any host.

Key properties

Property Replay engines Live execution‑state graph CRIU snapshots
State model Logical workflow state Live OS execution graph (process tree, PTY, sockets) Raw OS state of one process tree
Determinism Mandatory Not required – live system is inherently nondeterministic None – state captured as‑is
Persistence Append‑only event log Execution graph persisted as a live object; checkpoints optional Point‑in‑time image
Multi‑client attach Signals/queries only, no shared surface Concurrent heterogeneous clients (human, AI, monitoring) can attach to the same live graph No – restore yields a single process for a single restorer
Steering vs replay Signals drive workflow steps; cannot grab a live shell Operators can steer the running environment, mutate PTY, open files, etc. Freeze‑thaw only
Recovery Re‑run from entry point, replay log Re‑home the live graph if a checkpoint exists; otherwise rebuild from durable session state (identity + minimal session data) Restore snapshot on compatible host

Why it matters now

Agentic systems (LLM‑driven bots, autonomous dev‑ops agents) need two capabilities that replay alone cannot provide:

  1. Pause for a human inside a live environment – an AI may be building a Docker image, hit a compile error, and hand the session to a developer who needs to inspect the exact shell state, modify a file, and then let the agent continue.
  2. Transferable authority – the same live session must be observable by multiple operators, each with a clearly attributed edit grant. The system must serialize who performed which mutation so that audit trails stay coherent.

Replay can pause a logical workflow for a month, but it cannot reconstruct a half‑filled REPL, an open TCP connection, or a running web server. CRIU can freeze that state, but it offers no way for a second actor to attach while the process is alive, nor does it provide a durable identity that outlives the host.

The execution‑state continuity model fills that gap: it keeps the OS‑level graph alive, gives it a stable ID, and lets a distributed set of operators reach it over the network. The graph itself remains single‑homed (only one host runs it at a time), but the access topology is distributed – exactly the pattern needed for collaborative AI‑human workflows.


Historical perspective

Era Goal What was achieved What was missing
Sprite / Mosix (late 80s‑90s) Live process migration Moved a running process across nodes No durable identity; host crash killed the process
CRIU (2011) Capture live state Bit‑accurate checkpoint/restore No control plane, no multi‑actor steering
Durable‑execution engines (Temporal, Cadence, etc., ~2014‑present) Logical durability & identity Stable workflow IDs, cheap idle cost, deterministic replay No live OS state, cannot host interactive sessions
Execution‑state continuity (cmdop, Warp, E2B, etc.) Owner‑less, multi‑operator live graph Persistent identity + live OS graph + concurrent steering Still early; standards and ecosystem are forming

The missing column is the synthesis of live state + durable identity + distributed operator access.


Trade‑offs and design considerations

When to choose replay

  • Pure business logic where side effects are well‑encapsulated in activities.
  • Need for massive horizontal scaling and cheap idle cost.
  • Determinism is enforceable (e.g., financial transactions, order processing).

When to choose live execution‑state steering

  • Interactive workloads: shells, REPLs, long‑running servers that must stay open across hand‑offs.
  • Scenarios where a human or an AI must inspect or modify the exact runtime environment.
  • Systems that require audit‑ready attribution of every mutation performed by any operator.

When CRIU‑style snapshots are sufficient

  • One‑off migration or debugging of a single process.
  • Environments where you can afford a brief downtime while the snapshot is restored.
  • No need for concurrent multi‑actor access.

Overhead comparison

Aspect Replay Live graph CRIU
CPU / memory during idle Near‑zero (scale‑to‑zero) Host must keep process alive (baseline OS overhead) Zero after snapshot, but restore incurs cost
Network traffic Event log (compact) Streaming of PTY / socket data, occasional checkpoint uploads Transfer of whole image (potentially large)
Complexity Determinism enforcement, activity SDK Distributed control plane, identity service, checkpoint coordination Kernel‑level ptrace, compatibility constraints
Failure mode Re‑run from log – always recovers logical state If checkpoint exists, re‑home; otherwise rebuild from minimal session state (partial loss) Restore fails if host mismatch – may need full restart

A concrete reference implementation: cmdop

The open‑source project cmdop (see its GitHub repo) demonstrates the middle column:

  • It creates a single‑homed execution graph that persists across host restarts via periodic checkpoints.
  • Operators connect through a WebSocket‑based control plane that authenticates each action and records an audit trail.
  • The graph can be steered by humans, AI agents, or automated monitors simultaneously, with explicit grant‑based authority transfer.

Featured image Featured image – visual metaphor for the three durability models.


Why the distinction matters for the coming "agentic era"

  1. Human‑in‑the‑loop – Agents will increasingly need to defer to experts. A developer must be able to attach to the exact live container the agent is using, edit a config file, and hand control back without losing the agent’s context.
  2. Regulatory audit – Industries such as finance or healthcare demand a tamper‑evident log of who changed what in a live system. Replay logs capture activity results but not the low‑level OS mutations that matter for compliance.
  3. Performance – Certain workloads (e.g., long‑running data pipelines that keep sockets open) cannot be cleanly expressed as a series of deterministic activities; keeping the socket alive avoids costly reconnections.

If we continue to force every durable workload into the replay model, we will either (a) abandon the ability to interact with live state, or (b) build brittle adapters that try to reconstruct OS state from logs – a path that leads to subtle bugs and data loss.


Bottom line

  • Replayed systems give you deterministic durability for pure orchestration.
  • Frozen systems (CRIU) give you a perfect snapshot of a single process but no shared identity or multi‑actor control.
  • Steered systems expose a live OS execution graph as a first‑class, addressable object, enabling concurrent human‑AI steering while preserving a durable identity.

The industry is beginning to converge on the third model, but it still lacks a widely accepted name and a mature ecosystem. Recognizing the distinction now helps architects choose the right tool for the job and avoid forcing an ill‑suited durability model onto interactive, agent‑centric workloads.


What's next?

The next installment in the series will explore "AI as Operator, Not Controller: The Multi‑Actor Execution Model", showing how to design APIs and permission systems that let an LLM and a human share the same live execution graph safely.


Related reading

Comments

Loading comments...