An in‑depth walkthrough of the Cooley‑Tukey FFT, showing how the discrete Fourier transform can be rewritten as a set of smaller DFTs, why this reduces the complexity from O(N²) to O(N log N), and what the practical consequences are for real‑world signal processing.
Fast Fourier Transforms Part 1: The Cooley‑Tukey Algorithm
Connor Boyle – 11 September 2025
1. The problem the FFT solves
The discrete Fourier transform (DFT) turns a finite sequence (x[0],\dots,x[N-1]) of complex numbers into a new sequence (X[k]) that describes the signal’s frequency content:
[\displaystyle X[k] = \sum_{j=0}^{N-1} x[j],e^{-2\pi i,jk/N}]
If we compute this formula directly for each of the (N) output bins, we perform (N) multiplications for each of (N) sums – a total of (O(N^{2})) operations. For modest‑size signals this is acceptable, but for audio, image, or scientific data sets the cost quickly becomes prohibitive.
The key insight of the Cooley‑Tukey algorithm (the original "FFT") is that when (N) is composite we can factor the sum into a product of smaller DFTs. By applying the factorisation recursively we shrink the work to (O(N\log N)).
2. Re‑expressing the DFT with a radix‑(r) decomposition
Assume (N = r,d) with (r,d \in \mathbb{N}) and (r>1). Write the index (j) of the input as
[j = j_{1}r + j_{0},\qquad 0\le j_{0}<r,;0\le j_{1}<d]
and the output index (k) as
[k = k_{1}d + k_{0},\qquad 0\le k_{0}<d,;0\le k_{1}<r]
Substituting these expressions into the DFT and using the shorthand (W_{N}=e^{-2\pi i/N}) gives
[\begin{aligned} X[k] &= \sum_{j_{0}=0}^{r-1}\sum_{j_{1}=0}^{d-1} x\bigl(j_{1}r+j_{0}\bigr) ;W_{N}^{(j_{1}r+j_{0})(k_{1}d+k_{0})} \ &= \sum_{j_{0}=0}^{r-1} \Biggl[;W_{N}^{j_{0}k_{0}};\underbrace{\sum_{j_{1}=0}^{d-1} x\bigl(j_{1}r+j_{0}\bigr);W_{d}^{j_{1}k_{0}}}{\displaystyle =;F{x{r}^{,j_{0}}}[k_{0}] }\Biggr];W_{N}^{j_{0}k_{1}d} \end{aligned} ]
The inner sum is exactly a DFT of length (d) applied to the subsequence
[x_{r}^{,j_{0}}[j_{1}] = x\bigl(j_{1}r + j_{0}\bigr),\qquad 0\le j_{1}<d]
Thus the original transform can be written as a sum of (r) length‑(d) DFTs multiplied by a set of twiddle factors (W_{N}^{j_{0}k_{0}}) and (W_{N}^{j_{0}k_{1}d}). The algorithm proceeds in three conceptual stages:
- Split the input into (r) interleaved subsequences (sometimes called radix‑(r) streams).
- Recursively transform each subsequence with a DFT of size (d).
- Combine the results with the twiddle factors and a short (r)-point DFT for each output frequency.
3. Complexity analysis
If we denote the cost of an (n)-point DFT by (T(n)), the decomposition yields the recurrence
[T(N) = r,T(d) + O(N,r)]
where the linear term accounts for the twiddle‑factor multiplications and the final (r)-point DFTs. When (N) is a power of two we can choose (r=2) at every level, giving
[T(2^{n}) = 2,T(2^{n-1}) + O(2^{n})]
which solves to (T(2^{n}) = O(2^{n}n) = O(N\log N)). The same analysis works for any mixed‑radix factorisation; the dominant term is always (N) multiplied by the sum of the logarithms of the prime factors of (N).
If (N) is prime there is no non‑trivial factorisation, so the Cooley‑Tukey recursion cannot be applied. In that case one resorts to algorithms such as Bluestein's chirp‑Z transform, which embed the prime‑length DFT into a larger power‑of‑two convolution that does admit a Cooley‑Tukey decomposition.
4. From forward to inverse transform
The inverse DFT differs only by a sign change in the exponent and a scaling factor (1/N). Because the twiddle factors are symmetric, the same recursion works unchanged; the only modifications are:
- use (W_{N}^{+2\pi i/N}) instead of (W_{N}^{-2\pi i/N}), and
- divide every output by (N) after the final combination step.
The original Cooley‑Tukey paper actually presented the inverse algorithm first, noting that the forward transform follows by a simple conjugation.
5. Practical implications
- Speed – For signals whose length is a power of two, a well‑implemented radix‑2 FFT runs in a few hundred nanoseconds on modern CPUs, compared with milliseconds for the naïve DFT.
- Numerical accuracy – Fewer arithmetic operations mean less accumulated round‑off error. In practice the error growth is proportional to (\log N) rather than (N).
- Memory access patterns – The recursive splitting induces a bit‑reversal permutation of the input. Efficient implementations (e.g., the classic
fftwlibrary) reorder data in‑place to minimise cache misses. - Algorithmic flexibility – By choosing different factorizations (radix‑4, mixed‑radix, split‑radix) one can trade off the number of multiplications against the number of additions, adapting to the instruction set of a given processor.
6. Counter‑perspectives and common misconceptions
6.1 “FFT is a different transform”
A frequent source of confusion is the habit of saying "the FFT of a signal" as if the FFT were a mathematical operation distinct from the DFT. In reality the FFT is merely a family of algorithms that compute the exact DFT (or its inverse) more efficiently. The result is identical up to floating‑point round‑off, regardless of whether you use a radix‑2, radix‑4, or Bluestein implementation.
6.2 Prime‑length inputs
When (N) contains large prime factors the speed‑up of Cooley‑Tukey diminishes because the recursion cannot split the problem evenly. In such cases hybrid algorithms (e.g., Rader’s algorithm for prime‑length DFTs) are combined with Cooley‑Tukey to retain near‑optimal performance.
7. Interactive illustration (optional)
A small web demo can help visualise the three stages described above. The demo shows:
- an array of real‑valued samples (you can drag to change values),
- the complex rotations (W_{N}^{\cdot}) represented as clock‑hand graphics, and
- the accumulation of each output bin when you hover over it.
The source code for the visualization is available on GitHub and can be embedded in any modern browser.
8. Looking ahead
The Cooley‑Tukey decomposition is the backbone of virtually every modern FFT library, but it is not the whole story. In the next installment I will explore Bluestein’s algorithm, which enables an (O(N\log N)) FFT for arbitrary lengths by converting the problem into a convolution that does admit a radix‑2 decomposition. Understanding Bluestein will also clarify why many high‑performance libraries fall back to a mixed‑radix strategy that blends Cooley‑Tukey, Rader, and prime‑factor algorithms.
9. References & further reading
- J. W. Cooley and J. W. Tukey, An Algorithm for the Machine Calculation of Complex Fourier Series, Mathematics of Computation, 1965. (PDF)
- M. Frigo and S. G. Johnson, The Design and Implementation of FFTW3, Proceedings of the IEEE, 2005. (fftw.org)
- S. B. Bluestein, A Linear Filtering Approach to the Computation of Discrete Fourier Transform, IEEE Transactions on Audio and Electroacoustics, 1970. (PDF)
Thanks to Andre Archer for proofreading the manuscript. Any remaining errors are my own.
Comments
Please log in or register to join the discussion