Reverse-Mode AD Demystified: The Hidden Engine Powering Modern Machine Learning
Share this article
Automatic Differentiation (AD) is the silent workhorse behind neural network training, enabling efficient gradient computation for optimization. While backpropagation specializes in scalar-output neural networks, reverse-mode AD generalizes this process for arbitrary computational graphs. Here's how this critical algorithm works under the hood.
The Core Principle: Unfolding the Chain Rule
At its heart, AD decomposes computations into primitive operations, then applies the multivariate chain rule systematically. Consider the sigmoid function S(x) = 1 / (1 + exp(-x)), represented as a computational graph:
# Primitives decomposition
def sigmoid(x):
f = -x # df/dx = -1
g = exp(f) # dg/df = exp(f)
w = 1 + g # dw/dg = 1
v = 1 / w # dv/dw = -1/w²
return v
Reverse-mode AD calculates derivatives backward from output to inputs:
\frac{dS}{dx} = \frac{dS}{dv} \frac{dv}{dw} \frac{dw}{dg} \frac{dg}{df} \frac{df}{dx}
Handling Real-World Complexity: DAGs and Fan-Out
Most practical functions involve directed acyclic graphs (DAGs) with multi-input/multi-output nodes. For a function like f(x₁,x₂)=ln(x₁)+x₁x₂−sin(x₂):

Reverse-mode AD handles three key patterns:
Fan-in (Multiple Inputs): Derivatives propagate separately to each input.
\frac{\partial S}{\partial x_1} = \frac{\partial S}{\partial f} \frac{\partial f}{\partial x_1}, \quad \frac{\partial S}{\partial x_2} = \frac{\partial S}{\partial f} \frac{\partial f}{\partial x_2}Fan-out (Shared Outputs): Derivatives sum across output paths.
\frac{\partial S}{\partial x_1} = \sum_{i} \frac{\partial S}{\partial f_i} \frac{\partial f_i}{\partial x_1}
Why Reverse-Mode Dominates ML
"Reverse-mode AD is the generalization of backpropagation. When you have many inputs and few outputs—like neural networks with millions of parameters and a scalar loss—it’s dramatically more efficient than forward-mode."
Its secret weapon? Vector-Jacobian Products (VJPs). Instead of computing full Jacobian matrices, reverse-mode calculates:
Gradient = (Upstream Gradient) × (Local Jacobian)
This avoids storing enormous intermediate matrices—critical for large models.
Implementing Reverse-Mode AD in 30 Lines of Python
Here’s a minimal implementation using a computational graph with operator overloading:
class Var:
def __init__(self, v):
self.v = v # Forward value
self.grad = 0 # Accumulated gradient
self.predecessors = [] # (multiplier, input_node)
def backward(self, upstream_grad=1.0):
self.grad += upstream_grad
for mult, var in self.predecessors:
var.backward(upstream_grad * mult)
def __add__(self, other):
out = Var(self.v + other.v)
# Derivatives w.r.t inputs are both 1
out.predecessors.append((1.0, self))
out.predecessors.append((1.0, other))
return out
def log(x):
out = Var(math.log(x.v))
# d(log(x))/dx = 1/x
out.predecessors.append((1/x.v, x))
return out
Try it:
x = Var(2.0)
y = Var(3.0)
f = x * y + log(x)
f.backward() # Populates x.grad and y.grad
The Engine Beneath the Framework
Industrial systems like PyTorch and JAX build on these principles but add:
- Topological sorting for single-pass efficiency
- Just-in-time compilation (JAX)
- Sparse Jacobian optimizations
Understanding reverse-mode AD isn’t academic—it’s essential for debugging gradients, implementing custom layers, and pushing ML systems further. As the author notes:
"While industrial implementations have better ergonomics, the core remains reverse-mode AD on computational graphs. To truly master deep learning frameworks, start here."
Source: Adapted from Reverse-Mode Automatic Differentiation Explained by Eli Bendersky.