#Machine Learning

Why Deep Learning Performance Often Stalls and How to Fix It

Startups Reporter
4 min read

A practical guide to diagnosing whether your PyTorch workload is limited by compute, memory bandwidth, or framework overhead, and concrete steps—operator fusion, tracing, and Tensor Core usage—to move the bottleneck and keep GPUs busy.

The three things that consume GPU time

When a model runs on an A100 you can think of the total runtime as the sum of three distinct costs:

  1. Compute – the time the GPU spends executing floating‑point operations (FLOPs).
  2. Memory bandwidth – the time spent moving tensors between DRAM, shared memory and registers.
  3. Overhead – everything else: Python dispatch, PyTorch’s dispatcher, kernel launch latency, etc.

If you can tell which of these dominates, you can pick the right class of optimisations instead of trying a laundry list of tricks that may or may not help.


1. Compute‑bound regime

You bought a GPU that can deliver 312 TFLOP/s (tensor cores) but you only see a few TFLOP/s in practice. The reason is that the compute units are waiting for data or for the framework to schedule the next kernel. In a compute‑bound regime you should:

  • Use Tensor Cores – make sure matrix multiplications are performed in FP16 or BF16 so the hardware can apply fused‑multiply‑add.
  • Increase batch size or model width – larger matrix shapes raise the compute intensity (FLOPs per byte moved) and push the workload toward the peak FLOP rate.
  • Avoid unnecessary casts – each conversion forces a round‑trip through memory.

If you see the GPU utilization (the "GPU‑Util" column in nvidia‑smi) hovering near 100 % while memory bandwidth usage is lower than the device peak, you are likely compute‑bound.


2. Memory‑bandwidth bound regime

Even if you are using Tensor Cores, a lot of time can be spent shuffling data. A single unary operation such as x = x * 2 reads the whole tensor from global memory, writes it back, and does essentially no arithmetic. On an A100 the global memory bandwidth is ~1.5 TB/s, which translates to about 400 B elements per FLOP when using FP32. Consequently, any kernel that does fewer than ~100 arithmetic ops per element will be limited by memory traffic.

How to reduce bandwidth pressure

  • Operator fusion – combine consecutive pointwise ops into a single kernel so the intermediate results never touch DRAM. For example, x.cos().cos() can be compiled to one kernel that reads x once, applies two cosine calls, and writes the final result.
  • Use a fusion‑aware compiler – PyTorch’s NVFuser and XLA already fuse many patterns automatically.
  • Write custom kernels – when the automatic fusers miss a pattern, Triton (https://github.com/openai/triton) lets you hand‑craft kernels that keep data in registers or shared memory throughout the computation.
  • Batch small tensors – grouping many tiny tensors into a larger batch reduces the per‑tensor overhead of memory moves.

When you plot achieved FLOPs versus compute intensity (repeat count in a loop), you’ll see a flat line at low intensity (bandwidth bound) that rises linearly until it hits the hardware’s FLOP ceiling.


3. Overhead‑bound regime

Modern GPUs are orders of magnitude faster than the Python interpreter. If each kernel processes only a few thousand elements, the time spent in Python dispatch, autograd bookkeeping, and kernel launch can dominate the wall‑clock time. Typical signs:

  • GPU‑Util stays low even though the kernel queue is full.
  • Increasing batch size barely changes runtime – the extra work is absorbed by the CPU side, not the GPU.
  • Profiler shows long gaps between CPU and GPU events (see the pink bars in a PyTorch profiler trace).

Ways to shrink overhead

  • Trace the modeltorch.compile, torch.fx or torch.jit.trace capture a static graph, eliminating per‑step Python dispatch.
  • Use CUDA Graphs – pre‑record a sequence of kernel launches and replay them with a single driver call.
  • Replace Python loops with vectorised ops – every loop iteration that calls a kernel incurs launch latency.
  • Consider a lower‑level runtime – for workloads that truly need sub‑microsecond latency, a C++‑only implementation may be justified.

Putting it together: a quick diagnostic checklist

Symptom Likely bottleneck First action
nvidia‑smi shows < 30 % GPU‑Util, memory bandwidth near peak Memory‑bandwidth bound Try operator fusion (NVFuser, XLA, or Triton)
GPU‑Util near 100 % but FLOPs are only 10–20 % of peak Compute‑bound Increase batch size, enable Tensor Cores, verify data types
GPU‑Util low, CPU usage high, runtime unchanged by larger batches Overhead‑bound Trace the model (torch.compile), use CUDA Graphs

Why the three‑regime view matters

Understanding which regime you sit in prevents wasted effort. Adding more Tensor Cores to a memory‑bound model does nothing, just as fusing operators in a compute‑bound model yields negligible gains. By measuring FLOPs, memory traffic, and CPU‑GPU overlap you can steer your optimisation budget to the part of the pipeline that actually limits throughput.


Resources


Takeaway

Performance tuning is not a random walk through a list of tricks. Identify whether your workload is compute‑, bandwidth‑, or overhead‑bound, then apply the corresponding class of optimisation. When you keep the GPU busy in the right way, the “brrrr” sound you hear from the fans is a sign that the hardware is finally doing what you paid for.

Comments

Loading comments...