Automatic Differentiation (AD) is the silent workhorse behind neural network training, enabling efficient gradient computation for optimization. While backpropagation specializes in scalar-output neural networks, reverse-mode AD generalizes this process for arbitrary computational graphs. Here's how this critical algorithm works under the hood.

The Core Principle: Unfolding the Chain Rule

At its heart, AD decomposes computations into primitive operations, then applies the multivariate chain rule systematically. Consider the sigmoid function S(x) = 1 / (1 + exp(-x)), represented as a computational graph:

# Primitives decomposition
def sigmoid(x):
    f = -x          # df/dx = -1
    g = exp(f)      # dg/df = exp(f)
    w = 1 + g       # dw/dg = 1
    v = 1 / w       # dv/dw = -1/w²
    return v

Reverse-mode AD calculates derivatives backward from output to inputs:

\frac{dS}{dx} = \frac{dS}{dv} \frac{dv}{dw} \frac{dw}{dg} \frac{dg}{df} \frac{df}{dx}

Handling Real-World Complexity: DAGs and Fan-Out

Most practical functions involve directed acyclic graphs (DAGs) with multi-input/multi-output nodes. For a function like f(x₁,x₂)=ln(x₁)+x₁x₂−sin(x₂):

Computational Graph: Nodes for log(x1), x1*x2, sin(x2), and summation

Reverse-mode AD handles three key patterns:

  1. Fan-in (Multiple Inputs): Derivatives propagate separately to each input.

    \frac{\partial S}{\partial x_1} = \frac{\partial S}{\partial f} \frac{\partial f}{\partial x_1}, \quad
    \frac{\partial S}{\partial x_2} = \frac{\partial S}{\partial f} \frac{\partial f}{\partial x_2}
    
  2. Fan-out (Shared Outputs): Derivatives sum across output paths.

    \frac{\partial S}{\partial x_1} = \sum_{i} \frac{\partial S}{\partial f_i} \frac{\partial f_i}{\partial x_1}
    

Why Reverse-Mode Dominates ML

"Reverse-mode AD is the generalization of backpropagation. When you have many inputs and few outputs—like neural networks with millions of parameters and a scalar loss—it’s dramatically more efficient than forward-mode."

Its secret weapon? Vector-Jacobian Products (VJPs). Instead of computing full Jacobian matrices, reverse-mode calculates:

Gradient = (Upstream Gradient) × (Local Jacobian)

This avoids storing enormous intermediate matrices—critical for large models.

Implementing Reverse-Mode AD in 30 Lines of Python

Here’s a minimal implementation using a computational graph with operator overloading:

class Var:
    def __init__(self, v):
        self.v = v          # Forward value
        self.grad = 0       # Accumulated gradient
        self.predecessors = []  # (multiplier, input_node)

    def backward(self, upstream_grad=1.0):
        self.grad += upstream_grad
        for mult, var in self.predecessors:
            var.backward(upstream_grad * mult)

def __add__(self, other):
    out = Var(self.v + other.v)
    # Derivatives w.r.t inputs are both 1
    out.predecessors.append((1.0, self))
    out.predecessors.append((1.0, other))
    return out

def log(x):
    out = Var(math.log(x.v))
    # d(log(x))/dx = 1/x
    out.predecessors.append((1/x.v, x))
    return out

Try it:

x = Var(2.0)
y = Var(3.0)
f = x * y + log(x)
f.backward()  # Populates x.grad and y.grad

The Engine Beneath the Framework

Industrial systems like PyTorch and JAX build on these principles but add:

  • Topological sorting for single-pass efficiency
  • Just-in-time compilation (JAX)
  • Sparse Jacobian optimizations

Understanding reverse-mode AD isn’t academic—it’s essential for debugging gradients, implementing custom layers, and pushing ML systems further. As the author notes:

"While industrial implementations have better ergonomics, the core remains reverse-mode AD on computational graphs. To truly master deep learning frameworks, start here."

Source: Adapted from Reverse-Mode Automatic Differentiation Explained by Eli Bendersky.