Chapter Four: Synthesis
7. Synthesis by Analysis: Phase Vocoding
Phase vocoding is a form of synthesis by analysis in that a sound file is analyzed and is then resynthesized by the phase vocoder. The classic use of a phase vocoder is to stretch the original sound out in time without changing its original pitch. Other possible uses include compressing a sound in time without changing the original pitch or changing pitch without changing original time. Often a combination of the two (alterations of both pitch and time) is used. Unlike the classic studio technique of changing the speed of an audio tape, where time change is linked to pitch change (the slower the tape, the lower the pitch and vice versa), phase vocoding was a welcome tool to composers when it came along with the advent of digitized audio files and analysis algorithms as early as 1966 (Flanagan and Golden, Bell Labs). More practical phase vocoders were developed in the 1970's and 80's, perhaps most notably Mark Dolson's in 1983. These took advantage of the much higher computational speed available at the time.
The first part of the process involves dividing a digital input signal into numerous time segments or blocks of samples and then multiplying those blocks by an envelope called a windowing function. The shape of these windows will have an effect on the weighting of the resultant analysis. Common window shapes for phase vocoding are Hamming, Kaiser, Rectangle and von Hann (Hanning). It is well worth the effort to try each windowing function to observe the effect on your final sound. Click here to see diagrams of various windowing functions.
Each window is then subjected to a form of spectral analysis called an STFT, or short-time Fourier transform. Mentioned earlier in this text was Fourier's concept that all sound can be broken down into its constituent sine waves (their frequency, their amplitude and their phase relationship). While the mathematics of the STFT algorithm are beyond the scope of this article, it is important to note that the STFT procedure applies an FFT or fast Fourier transform (a 'trick' but highly efficient form of the discrete Fourier transform (DFT) to the windowed samples one window at a time. A DFT analysis is considered to be discrete in time (i.e. a limited block of sample values) and discrete in the frequency domain (i.e. it can be pictured as isolated spectral lines). So for each window, time and spectrum have been frozen. An STFT can be viewed as a sequence of DFTs or as a bank of bandpass filters equally spaced from 0 Hz to the Nyquist frequency.
Several analysis parameters are set before analysis begins. The first is how many frequency bands will be analyzed (these are sometimes referred to as channels). Frequency bands are equally spaced by cycles per second, begin at 0 Hz and end at the Nyquist frequency (sampling rate/2) — each equally spaced frequency is the center of its band. The number of bands is always a power of two (256, 512, 1024) — one reason the FFT is so efficient. The analysis actually creates a mirror for each positive frequency band in the negative range, with both having half the magnitude of the real frequency band measured. Among the several processes that take place after this computation, the phase vocoder throws away the negative frequencies and doubles the magnitude of the positive ones.
There is a computational relationship between the number of bands of analysis, the band frequency spacing, the sampling rate, and the number of samples in each window (frame size).
band frequency spacing = (sampling rate / 2) / number of positive frequency bands
window length in samples = sampling rate / band frequency spacing
number of bands = window length in samples / 2
window length in samples
band frequency spacing
The output of a single FFT of windowed samples is called a frame. It consists of two sets of values: the magnitudes (or amplitudes) of all the frequency bands, and the initial phases of all the frequency bands. For each band, therefore, its amplitude at the time of analysis can be quantified, and its phase, which gives clues as to where in that band's range the pitch may actually lie, is determined. The amplitude/phase pair of data for each band is called a bin. The analysis portion of phase vocoding creates a file of sequential frames, each one of which holds multiple bins.
Looking at the exponentially increasing frequency resolution above, as represented by the band frequency spacing, it seems like the best course of action for a really great analysis is to go with the largest number of bands possible. Not so fast! Look at the window lengths above as well...with each increase in frequency resolution comes a longer window of analysis. Time is stopped for each window — the FFT assumes all frequencies in a window occur at once, therefore, any change in frequency within the duration of the window will be missed. To get as much resolution as 1 Hz band spacing, the window length necessary would take 44,100 samples — at 44.1 kHz sampling rate, that's 1 second for each window! On the other hand, if one wished for 1 ms time resolution (that would be about 44 samples per window), it would be necessary to drop the frequency resolution down to a band spacing of 22 bands, or about 1000 Hz apart. Obviously, some compromise in the middle makes sense, due to this fundamental tradeoff between time (duration) and frequency. One aspect of working with phase vocoding is constant testing of different settings to see which works best for a particular sound.
Another parameter that is set before analysis begins is the hop size. This is a measurement of how far into the input file (measured in samples) the next window of analysis will begin. Another way of looking at this is referred to as the overlap factor. Overlap is a measure of the maximum # of analysis windows covering each other at a given point. If the hop size is set at half the window size, the overlap factor would be two. Some samples would be included in two successive analysis windows as illustrated below. A higher overlap factor is great for future time-dilation. However, having a high overlap factor will cause a "smearing" effect, though many composers find this desirable.
frame size = 8 hop size = 4 overlap factor = 2
The analysis portion of phase vocoding creates a file of sequential frames, each one of which holds multiple bins. PV analyses of long files with large numbers of bands and high overlap factors can be huge.
The function of phase vocoding is not just to provide analysis data, but to use that analysis data to resynthesize the sound that was analyzed using a process called an inverse FFT. Each frame of the PV analysis may be resynthesized by several methods, including the overlap-add method or the oscillator bank resynthesis method (each method doing quite a bit of manipulation to get a smooth result). But in any case, this is the stage at which modifications of the original sound can take place. To alter the speed of the original input signal (i.e. to slow it down or to speed it up), the rate at which frames are resynthesized can be sped up or slowed down from the rate of original acquisition accordingly. The PV interpolates between the frames to make that a smoother process, and this is where a greater overlap factor can help. Because the frequency content of each frame is not changed, changes in the speed do not result in changes in pitch.
Conversely, frames can be played back at the rate in which they were analyzed, but the frequencies of the entire spectrum can be shifted up or down proportionally, resulting in change in pitch, but not one of speed. A variant of this would be to multiply the spacing between the bands (rather than subtracting or adding), resulting in interesting distortions of timbres, as harmonics would then have a very different, likely inharmonic spacing.
Resynthesis can often create artifacts or anomalies in the interpolation process. Many composers enjoy using these artifacts, but many who have used phase vocoding for a long time have grown weary of hearing them become the major focus of the music. Changing analysis parameters can often avoid artifacts or make more interesting ones. Additionally, some frequencies wander between bands and can cause unnatural results. Try wider bands or use another method called MQ Analysis that tracks 'threads' within certain percentages of drifting frequency (provided by a free program called SPEAR). Pitch-tracking phase vocoding is another option. Here the phase vocoder tries to line up its bands as harmonics of a perceived fundamental frequency.
Choices, choices: Observing the Soundhack Phase Vocoder window below, you will see settings for # of bands, overlap factor, window function, time scaling (longer or shorter), pitch scaling. By selecting the Scaling Function checkbox, you may draw your own variable time-stretching function or pitch changing function. Again, it is almost impossible to predict exactly how each sound will react to a particular setting, so experimentation by changing # of bands, overlap factor, window function and of course setting the time or pitch scaling you want is essential. Soundhack is freeware for the Mac that every composer should have.
Vocabulary: synthesis by analysis, windowing function, STFT, DFT, FFT (know what they stand for), bands, frame, hop size, overlap factor, resynthesis, artifacts, know formulas for computing band frequency spacing, window length, # of bands
References: Roads, Curtis: The Computer Music Tutorial (has a great separate section on FFT math), Dodge, Charles and Jerse, Thomas: Computer Music
Audio examples coming soon.