Azure's Long-Running Operation Protocol, and Why Async Control Plane Design Differs Across Clouds
#Cloud

Azure's Long-Running Operation Protocol, and Why Async Control Plane Design Differs Across Clouds

Cloud Reporter
6 min read

Microsoft's engineering team published a clear walkthrough of how Azure Resource Manager tracks operations that outlive a single HTTP request. For teams running multi-cloud control planes, the polling contract behind every `az aks create` is worth understanding, because AWS, Google Cloud, and Azure each solved the same async problem differently, and those differences shape how you build automation.

A new post from Microsoft's Apps on Azure team, authored by Arav Goyal and colleagues, documents something most engineers use daily without thinking about: the protocol Azure Resource Manager (ARM) uses to track work that cannot finish inside a single HTTP request. The mechanics sound mundane until you are writing cross-cloud automation and discover that every provider answers the same question, "how does a client wait for a multi-minute operation?", with a different contract.

Featured image

What Changed

The original post on Microsoft Community Hub is documentation rather than a product release, but it formalizes the long-running operation (LRO) contract that Azure Resource Manager has implemented for years. Every control plane request, whether it originates from the Portal, the Azure CLI, PowerShell, an SDK, or a raw REST call to management.azure.com, flows through ARM. ARM authenticates it, authorizes it against role assignments and policy, and forwards it to the resource provider that owns the resource type. Microsoft.Compute owns virtual machines, Microsoft.ContainerService owns managed Kubernetes, and so on.

When an operation will take longer than a synchronous response can accommodate, the resource provider returns an HTTP 201 Created or 202 Accepted along with one or both of two headers:

  • Azure-AsyncOperation points to a status URL. Polling it returns a structured body with a status field that moves from InProgress to a terminal Succeeded, Failed, or Canceled. Failures carry a structured error object.
  • Location points to a URL that returns 202 while work is in flight and 200 OK with the final payload once complete: the resource itself for a PUT, or the action result for a POST such as start or restart.

Both may carry a Retry-After header telling the client how many seconds to wait before polling again. The guidance is explicit: prefer Azure-AsyncOperation when present, because a structured status response tells you more than the implicit "still 202" signal from polling Location alone. Many resources also expose a provisioningState property on the resource manifest itself, giving clients a secondary completion signal when they happen to be reading the resource for other reasons. The async operation URL remains the authoritative source.

The reasons ARM cannot simply hold a connection open are the same reasons every cloud faces this problem. Intermediate proxies and load balancers time out long-lived connections, clients go offline mid-wait, and pinning a TCP connection open for a multi-hour cluster provision burns server resources for no useful work, since the actual provisioning happens elsewhere.

Provider Comparison

This is where the post becomes useful beyond Azure. The async control plane problem is universal, but the three major providers expose it through meaningfully different contracts, and that matters when you are standardizing tooling across clouds.

Azure uses the header-driven model described above: a 202, an Azure-AsyncOperation or Location URL, Retry-After pacing, and terminal states surfaced both at the operation level and through provisioningState. The client polls. There is no first-class operation resource you list and query independently of the headers you were handed, though the async operation URL is itself a queryable endpoint.

Google Cloud takes the most explicit approach. Most mutating calls return a first-class Operation resource with its own name, and you poll operations.get or call operations.wait until the done field flips to true. The Operation is a durable, addressable object you can query, list, and reason about independently of the request that created it. For automation that needs to reconnect to an in-flight operation after a process restart, this model is the cleanest, because the operation has a stable identity rather than living inside a header from a response you may no longer have.

AWS generally does not use a uniform LRO envelope at all. Behavior is per-service. CloudFormation exposes stack status fields you poll with describe-stacks. EKS cluster creation returns a status you poll with describe-cluster. EC2 instance state lives in describe-instances. The SDKs paper over this with waiters, client-side polling loops with built-in backoff that abstract away the per-service status field. The trade-off is consistency: there is no single protocol to learn, but also no single contract to rely on, so cross-service automation tends to lean on the SDK waiter rather than a documented wire-level guarantee.

The practical distinction comes down to where the operation's identity lives. Google gives you a durable operation object. Azure gives you a status URL and a provisioningState you can fall back to. AWS gives you a per-service status field and an SDK abstraction over it. None is wrong, but they are not interchangeable, and a team that assumes Azure's header model will translate directly to AWS or GCP will write brittle code.

Business Impact

For organizations running real multi-cloud estates, three points carry weight.

First, polling cadence is an operational cost, not a detail. The Microsoft post closes by flagging this directly: when no Retry-After is supplied, the client picks its own interval, and across a platform where some operations finish in eight seconds and others run for hours, that interval determines how much effort goes into useful status retrieval versus repeatedly asking an unfinished operation whether it is done. The same tension exists on every cloud. Aggressive polling triggers throttling, and a throttled client is no better off than a patient one. When you build orchestration that fans out hundreds of concurrent deployments, honoring Retry-After (and implementing sane backoff where it is absent) is the difference between predictable completion times and self-inflicted rate limiting.

Second, abstraction layers hide these differences until they fail. Terraform, Pulumi, Crossplane, and the cloud SDKs all implement the per-provider polling logic for you. That is exactly why teams forget the contracts exist, right up until a long-running operation times out in CI, a provisioning state hangs in a transitional value, or a reconnect-after-restart scenario exposes that Azure's status URL was never persisted. Understanding the underlying protocol is what lets you debug the abstraction when it leaks.

Third, migration and tooling decisions should account for control plane ergonomics. If your platform team is standardizing on a single internal deployment API across clouds, Google's durable Operation resource is the easiest to model uniformly, Azure's status URL plus provisioningState gives you two observation points to reconcile, and AWS's per-service approach pushes you toward SDK waiters rather than a wire contract. These are not reasons to pick one cloud over another, but they are real engineering costs that belong in the estimate when you scope multi-cloud automation.

The long-running operation protocol is infrastructure that is invisible when it works. A user runs a command, waits, and sees a result. Underneath sits a well-defined contract: a 201 or 202, a status URL, headers that say where and how often to check, a predictable set of terminal states, and an optional second signal. Azure's version of that contract is now documented cleanly enough to read in a single sitting, and reading it alongside the AWS and Google equivalents is a worthwhile exercise for anyone whose automation has to speak to more than one cloud. The contract you assume is universal is the one that breaks your pipeline first.

Comments

Loading comments...