Azure Container Registry (ACR) uses a stamp‑based architecture and proactive rebalancing to keep tenant workloads isolated and performant. By monitoring hot‑node CPU metrics and other signals, engineers move registries between stamps or provide dedicated stamp isolation, all without changing customer endpoints. This article explains the stamp concept, the rebalancing workflow, the metrics that trigger moves, and the business impact for both large AI workloads and everyday users.

How Azure Container Registry Guarantees Predictable Multi‑Tenant Performance at Scale

What changed?

Azure Container Registry (ACR) has always been a multi‑tenant service, but recent operational enhancements make its performance guarantees far more reliable. Microsoft’s engineering team now proactively rebalances registries between ACR stamps whenever a stamp shows sustained hot‑node CPU pressure, and it can provision dedicated stamp isolation for exceptionally large workloads. These practices are still operator‑driven, but they are on a clear roadmap toward automation.

Provider comparison → ACR vs. other container registries

Feature	Azure Container Registry (ACR)	Amazon Elastic Container Registry (ECR)	Google Artifact Registry (GAR)
Multi‑tenant isolation model	Stamp architecture (capacity pool, fault domain, update domain) with active rebalancing	Region‑wide shared clusters; isolation via VPC endpoints and throttling limits	Multi‑regional clusters; isolation via project‑level quotas
Visibility of rebalancing	Transparent to customers; endpoint never changes	No explicit rebalancing; customers must scale nodes manually	Limited to scaling of underlying Cloud Run services
Proactive metrics	Hot‑node P95 CPU, throttling rates, latency tails	CloudWatch alarms on CPU/Network, but no automated tenant moves	Monitoring via Cloud Monitoring, manual scaling actions
Isolation options for large tenants	Optional dedicated stamp (fewer co‑tenants, independent fault & update domains)	Private link + dedicated VPC, but still shares underlying hardware	Separate regional bucket, but still shares compute resources
Automation roadmap	Automated stamp provisioning, lifecycle management, AI‑driven signal weighting	Ongoing improvements to auto‑scaling policies	Planned auto‑scaling for high‑traffic artifacts

Why the stamp model matters: A stamp is a self‑contained deployment unit inside a region, consisting of VM Scale Set (VMSS) compute pools for registry logic and dataproxy, plus a pool of Azure Storage accounts for blob data. Each stamp acts simultaneously as a capacity pool, a fault domain, and an update domain. When a registry moves from one stamp to another, all three aspects change together, but the DNS‑resolved endpoint stays the same, making the migration invisible to the client.

How rebalancing works

Signal collection – ACR continuously ingests telemetry from each VMSS node. The primary proactive signal is hot‑node P95 CPU:
- For every 1‑minute interval, compute the average CPU of the busiest node.
- Across a 12‑hour peak window, take the 95th percentile of those per‑minute values.
- This yields a metric that reflects sustained hot‑spot pressure while filtering out momentary spikes.
Threshold evaluation – When hot‑node P95 exceeds a configurable threshold (e.g., 45 %), or when reactive signals such as sustained throttling or error bursts appear, the rebalancing workflow is triggered.
Registry selection – Engineers identify the registries contributing most to the load, prioritizing those with the highest traffic or the most volatile latency profiles.
Destination stamp choice – A less‑utilized stamp in the same Azure region is selected. If moving a registry would simply shift the hot‑spot to another shared stamp, the team may opt for additional stamp isolation, provisioning a stamp with fewer co‑tenants.
Cut‑over – The control plane updates the registry’s home_stamp field. DNS routing follows automatically; in‑flight requests on the source stamp drain within 30–60 seconds, and new traffic lands on the destination stamp within minutes. No endpoint change is required.
Post‑move validation – Latency and throughput are monitored for a 24‑ to 48‑hour window to confirm the expected improvement before proceeding with the next batch.

Example: AI workload isolation

A large AI customer owned 40 registries across two shared stamps. Four registries generated 96.7 % of the traffic. The engineering team performed a phased migration:

Phase 1 – 30 low‑traffic registries moved to validate tooling.
Phase 2 – Medium‑traffic registries moved in sub‑batches, each observed for 24 hours.
Phase 3 – The four high‑traffic registries moved one at a time, each observed for 48 hours.

Resulting metrics

Stamp	Registry pool hot‑node P95 change	Dataproxy pool hot‑node P95 change
A (source)	–7 % (flat)	–34 % (96 % → 64 %)
B (source)	–33 % (‑3 pp)	–44 % (‑5 pp)

All other tenants on those stamps experienced lower tail latency and more headroom, even though they never saw a migration notice.

Business impact

Predictable pull performance – By keeping hot‑node CPU under control, ACR reduces the variance in image pull latency, which directly improves CI/CD pipeline speed and application start‑up times.
Reduced blast radius – Isolating a noisy tenant prevents its traffic spikes from affecting unrelated customers, lowering the risk of SLA breaches.
Cost efficiency – After a stamp is relieved of excess load, the VMSS minimum instance count can be reduced, translating into lower operational spend for Microsoft and potentially lower pricing pressure for customers.
Customer confidence – The transparent nature of the move means developers never need to change scripts, CI pipelines, or endpoint configurations, preserving productivity.

Migration considerations for customers

Consideration	Guidance
Geo‑replication – Each regional replica is bound to a single stamp. Ensure your DNS TTL is set to a low value (e.g., 60 seconds) if you rely on custom DNS caching, although ACR’s internal Traffic Manager handles most routing.
Private endpoints – Dataproxy traffic that flows through private endpoints follows the same stamp migration path. No re‑configuration is needed; the private link remains valid because the underlying storage account does not move.
Performance monitoring – Use Azure Monitor metrics `acr_dataproxy_cpu_percent` and `acr_registry_cpu_percent` to watch hot‑node trends. Set alerts near the 45 % P95 threshold to be aware of upcoming rebalances.
Isolation requests – If you anticipate sustained high traffic (e.g., large model training pipelines), open a support ticket requesting dedicated stamp isolation. Provide expected peak QPS and data volume to help the engineering team size the stamp.

Roadmap to automation

While today’s rebalancing is manual, Microsoft has announced three automation milestones:

Signal weighting engine – An AI model will combine reactive and proactive signals to score stamps in real time.
Automated stamp provisioning – When a score exceeds a threshold, a new stamp will be spun up automatically in the same region.
Self‑service isolation portal – Customers will be able to request dedicated stamp isolation via the Azure portal, with instant provisioning based on the signal engine’s recommendation.

Bottom line

Azure Container Registry’s stamp architecture, combined with proactive hot‑node CPU monitoring and operator‑driven rebalancing, delivers consistent, predictable performance for all tenants, from small dev teams to massive AI workloads. The practice is invisible to end users, yet it improves latency, reduces risk, and can lower costs. As Microsoft progresses toward full automation, customers can expect even faster response times to load shifts and a more transparent path to dedicated isolation when needed.

For deeper technical details, see the official ACR documentation on stamp architecture and the hot‑node CPU metric guide.

#Azure #Container registry #Multi‑tenant #Performance #Rebalancing

How Azure Container Registry Guarantees Predictable Multi‑Tenant Performance at Scale

How Azure Container Registry Guarantees Predictable Multi‑Tenant Performance at Scale

What changed?

Provider comparison → ACR vs. other container registries

How rebalancing works

Example: AI workload isolation

Business impact

Migration considerations for customers

Roadmap to automation

Bottom line

Comments