Running AI Workloads on Azure Container Apps: What Breaks at Startup and How to Fix It

Microsoft's Container Apps team published a field guide to the failures that hit teams the moment they move from web APIs to model servers: probe timeouts, OOM kills, silent GPU fallbacks, and LangChain startup stalls. The fixes are practical, but they also expose a strategic question about whether serverless containers are the right home for inference at all.

A new entry in Microsoft's Troubleshooting Azure Container Apps in Production series tackles the part of the platform story that vendor marketing usually skips: what actually goes wrong when you put a PyTorch model server, an ONNX inference service, or a LangChain agent inside Azure Container Apps. The short version is that the assumptions baked into a serverless container platform, fast startup, predictable memory, linear CPU scaling, are exactly the assumptions an AI workload violates. The article walks through four startup failures and the fixes for each. Below is what changed in the guidance, how the platform compares to the alternatives, and what it means for teams deciding where inference should live.

What the guidance covers

The core message is a reframing. A Django REST API starts in seconds and holds a steady memory footprint. A large language model, a vision model, or an embedding pipeline can take minutes to load, consume gigabytes before serving a single request, and fail in ways that look nothing like a normal out-of-memory error. Container Apps does not change its behavior to accommodate that profile, so the operator has to.

Probe timeouts during model load. The most common failure is a restart loop. The container starts, begins loading weights layer by layer, and then vanishes because the liveness probe fired and timed out before loading finished. The fix is to separate the startup probe from the liveness probe. A startup probe suppresses the liveness check while it runs, so a generous failureThreshold and periodSeconds (for example, 40 attempts at 15 seconds each, giving a 10 minute window) buys the model time to load. The companion pattern is a three-tier health endpoint: startup returns 200 immediately, liveness reports process health, and readiness returns 503 until model_loaded flips true. The model itself loads in a background task so the HTTP server can answer probes right away. See the health probes documentation for the full schema.

OOM kills under inference load. Exit code 137 is the signature here, the kernel OOM killer sending SIGKILL because the container crossed its memory limit. The trap is that ML memory is not constant. A model that sits at 4 GB at rest can spike to 7 GB during inference when batch sizes grow or the KV cache warms up. With the non-GPU ceiling at 4 vCPU and 8 Gi, that headroom disappears fast. The recommended mitigations stack: quantize the model (4-bit quantization via bitsandbytes takes a 7B model from roughly 14 GB to 4 GB), batch requests to flatten per-request spikes, set explicit memory limits matched to peak usage, and have the readiness probe shed traffic when psutil reports memory above 90 percent.

Silent GPU fallback. This is the most expensive failure because it does not announce itself. Deploy a CUDA container to a CPU-only environment and torch.cuda.is_available() quietly returns False, the model falls back to CPU, and inference runs 50x slower while everything looks healthy. The fix is partly discipline (log the selected device and CUDA version at startup, never assume) and partly configuration. GPU work requires a dedicated serverless GPU workload profile, created with something like --workload-profile-type "NC24-A100". The CUDA toolkit in your image must be at or below the host driver version, which is why the guide steers toward the official pytorch/pytorch images that bundle compatible CUDA builds.

LangChain and RAG startup stalls. Agent and retrieval pipelines often do heavy work at boot, embedding thousands of documents or building an in-memory vector index, and much of that work is synchronous, blocking the thread that should be answering probes. The fix mirrors the model-loading pattern: push initialization into a background task with run_in_executor so CPU-bound work stays off the event loop. The more strategic recommendation is to stop rebuilding indexes entirely. An in-memory FAISS store is lost on every restart and adds minutes to startup. Moving to Azure AI Search makes the index persistent and reduces startup to a connection handshake.

Provider comparison: where this fits

The troubleshooting steps are Azure-specific, but the failure modes are not, and that matters for any team weighing platforms. Every serverless container runtime imposes the same fundamental tension between fast scale-to-zero economics and slow-loading stateful models.

Azure Container Apps sits in the middle of the market. It gives you Kubernetes-style probes and scaling without the operational weight of running AKS yourself, plus a serverless GPU tier that bills per second. The cost is the constraint surface the article documents: an 8 Gi memory ceiling on CPU plans, GPU access gated behind specific workload profiles and regions, and probe defaults tuned for web traffic rather than model loading.

AWS splits the same job across more services. App Runner is the closest analog for CPU inference but has no GPU story, so GPU workloads push you toward ECS on GPU instances, EKS, or SageMaker endpoints. SageMaker handles the model-loading and health-check problems for you, but at the price of vendor lock-in and a higher floor cost. The trade is convenience for flexibility.

Google Cloud Run is arguably the most direct competitor, and it now supports GPU attachment with scale-to-zero, which is a genuine advantage for spiky inference traffic where idle GPU cost is the killer. Cloud Run's startup CPU boost and its own startup probe model address the same load-time problem from a different angle.

The pricing logic across all three rewards the same architecture. Idle GPU time is the dominant cost in inference hosting, so scale-to-zero is worth real money, but only if your model loads fast enough that cold starts do not wreck latency. That is precisely why the article's advice to bake weights into the image or pre-stage them on a mounted volume is more than a reliability tip. Download-at-startup turns every cold start into a network gamble and lengthens the exact window that scale-to-zero pricing depends on.

Migration considerations

For teams already on Container Apps for their stateless services, extending to inference is appealing because the deployment model and observability are shared. The realistic migration path is to treat model serving as a distinct profile rather than another app. That means a separate environment with a GPU workload profile, a Container Apps Job for pre-loading weights onto an Azure Files share, and probe configurations that look nothing like your API defaults.

The persistence decision is the one that travels poorly between platforms. A pipeline built on in-memory FAISS will behave acceptably in development on any cloud and then fail the same way in production everywhere, because the problem is architectural, not vendor-specific. Committing early to a managed vector store, whether Azure AI Search, a managed Cosmos DB vector index, or the equivalent on another cloud, removes a class of startup failures and keeps your retrieval layer portable.

Business impact

The practical payoff of this guidance is fewer 2 a.m. restart loops, but the strategic read is more interesting. Each of the four scenarios is really a statement about fit. If your model loads in seconds after quantization and your traffic is bursty, serverless containers with scale-to-zero are a strong economic choice and these fixes are routine hardening. If your model is large, loads slowly, and serves steady traffic, the probe gymnastics and memory tuning are warning signs that a dedicated inference platform or an always-on GPU deployment may cost less in engineering time than it saves on compute.

The Log Analytics queries the article includes, filtering ContainerAppSystemLogs_CL for exit code 137 or tracking model load time across restarts, are worth adopting regardless of platform philosophy, because the metrics they expose (load time, OOM frequency, inference latency) are the same numbers that should drive the build-versus-buy decision for inference hosting. Microsoft says the next installment in the series moves from reactive fixes to proactive observability and self-healing, which is the right direction. The teams that get inference economics right are the ones measuring these failure patterns before they have to troubleshoot them.