Fast Convolution Algorithms
Fast Convolution Algorithms
Overlap-add, Overlap-save
Introduction
L1
X
h(k) x(n k)
(1)
k=0
or filter sequence until they are both the same length. If the FFT of the
signal x(n) is term-by-term multiplied by the FFT of the filter h(n), the
result is the FFT of the output y(n). However, the length of y(n) obtained
by an inverse FFT is the same as the length of the input. Because the DFT
or FFT is a periodic transform, the convolution implemented using this FFT
approach is cyclic convolution which means the output of (1) is wrapped or
aliased. The tail of y(n) is added to it head but that is usually not what
is wanted for filtering or normal convolution and correlation. This aliasing,
the effects of cyclic convolution, can be overcome by appending zeros to
both x(n) and h(n) until their lengths are N + L 1, and by then using
the FFT. The part of the output that is aliased is zero and the result of the
cyclic convolution is exactly the same as non-cyclic convolution. The cost is
taking the FFT of lengthened sequences sequences for which about half
the numbers are zero. Now that we can do non-cyclic convolution with the
FFT, how do we account for the effects of sectioning the input and output
into blocks?
2.1
Overlap-Add
Because convolution is linear, the output of a long sequence can be calculated by simply summing the outputs of each block of the input. What is
complicated is that the output blocks are longer than the input. This is
dealt with by overlapping the tail of the output from the previous block
with the beginning of the output from the present block. In other words, if
the block length is N and it is greater than the filter length L, the output
from the second block will overlap the tail of the output from the first block
and they will simply be added. Hence the name: overlap-add. Figure 1
illustrates why the overlap-add method works, for N = 10, L = 5.
Combining the overlap-add organization with use of the FFT yields a
very efficient algorithm for calculating convolution that is faster than direct
calculation for lengths above 20 to 50. This cross-over point depends on the
computer being used and the overhead needed by use of the FFTs.
2.2
Overlap-Save
y = h x = y1 + y2 +
*
x1
y1 = h x1
*
x2
y2 = h x2
*
y =h x
3
*3
x4
y4 = h x4
*
10
20
30
40
10
20
30
40
Figure 1: Overlap-Add Algorithm. The sequence y(n) is the result of convolving x(n) with an FIR filter h(n) of length 5. In this example, h(n) = 0.2
for n = 0, . . . , 4. The block length is 10, the overlap is 4. As illustrated in
the figure, x(n) = x1 (n) + x2 (n) + and y(n) = y1 (n) + y2 (n) + where
yi (n) is result of convolving xi (n) with the filter h(n).
convolution), the head and the tail overlap, so the FFT length is 14. (In
practice, block lengths are generally chosen so that the FFT length N +L1
is a power of 2).
x
y = h x*
x1
y1 = h x1
*
X
x2
y2 = h x2
*
X
x
y =h x
3
*3
X
x4
y4 = h x4
*
10
20
30
40
10
20
X30
40
Figure 2: Overlap-Save Algorithm. The sequence y(n) is the result of convolving x(n) with an FIR filter h(n) of length 5. In this example, h(n) = 0.2
for n = 0, . . . , 4. The block length is 10, the overlap is 4. As illustrated in
the figure, the sequence y(n) is obtained, block by block, from the appropriate block of yi (n), where yi (n) is result of convolving xi (n) with the filter
h(n).
2.3
Because the efficiency of the FFT is O(N log(N )), the efficiency of the overlap methods for convolution increases with length. To use the FFT for
convolution will require one length-N forward FFT, N complex multiplications, and one length-N inverse FFT. The FFT of the filter is done once and
4
stored rather done repeatedly for each block. For short lengths, direct convolution will be more efficient. The exact length of filter where the efficiency
cross-over occurs depends on the computer and software being used.
If it is determined that the FFT is potentially faster than direct convolution, the next question is what block length to use. Here, there is
a compromise between the improved efficiency of long FFTs and the fact
you are processing a lot of appended zeros that contribute nothing to the
output. An empirical plot of multiplication (and, perhaps, additions) per
output point vs. block length will have a minimum that may be several
times the filter length. This is an important parameter that should be optimized for each implementation. Remember that this increased block length
may improve efficiency but it adds a delay and requires memory for storage.