<<home <<previous next>>

low latency pitch shifting

My previous page on pitch shifting illustrated issues of a 'dumb' time domain pitch shifter using two delay lines. Echoes and modulations spoil the sound, for which reason such a pitch shifter is hardly of practical use. I revisited the method many times though, because it's response felt snappier than other methods like WSOLA and phase vocoder. This pages illustrates how a dumb pitch shifter gets a bit smarter through pitch dectection while achieving an average latency of only a few milliseconds. The result sounds much like Digitech's Whammy pedal. No formant preservation, but fast response. And pretty robust with monophonic input.

Let me briefly resume the working of the basic time domain pitch shifter. Two playback read pointers move alternating at non unity speed through a circular buffer with input samples. An automatic crossover function (window) masks discontinuities occurring at read pointer jumps between maximum delay and zero delay. When the two output signals are in phase, the result is optimal. But this is mostly not the case. With the outputs out of phase, periodic cancellation will lead to very audible amplitude modulation in periodic sound. Another adverse effect of output overlap is the audible duplication of transient sounds. The picture below shows a simple pitch shifter in Pure Data, with a sinewave input and amplitude modulation in the pitch shifted output.

amplitude modulation in a pitch shifted sinusoid

Playing around with my Pure Data patches, I found that a perfect non-modulated output could be produced by matching parameter settings to the frequency of an input sinewave. Basically, a maximum delay of two times the sinewave's period length seemed to give good result:

maximum delay in milliseconds = 2 * (1 / frequency) * 1000
= 2000 / frequency

This triggered the idea to use pitch tracker [helmholtz~] and synchronize delay line parameters to the input pitch. A snapshot of the Pd patch below illustrates how the output sinusoid is no longer amplitude modulated:

The experiment was encouraging, but it was not close to a practical pitch shifter implementation. Everytime when the input frequency was changed, not only would maximum delay time change abruptly, but the actual delay time (being a phase between zero and maximum delay time) would jump too. This of course caused discontinuities in the pitch shifted output. The idea of pitch synchronization worked, but it should be implemented in a different way.

Admittedly, I've been searching on internet for a solution first. The wildest proposals have been published on the topic of low latency pitch shifting. What put me on the right track was this article by Azadeh Haghparast et. al. It describes how a single delay line is read with variable speed, while splice points for the read pointer are found by cross correlating the candidate overlap areas. The article is illuminating because it gives comprehensive definitions and illustrations of the problem at hand. In the end, I implemented a similar approach, but based on pitch detection instead of cross correlation.

read pointer jump

The basic idea of a delay lines pitch shifter is to write input data at a constant speed, while copying that data to the output at a different constant speed. Since the audio card input and output run at the same sample rate, the speed difference is simulated by reading 'in between the samples', or by skipping samples. Obviously, it doesn't take long before the distance between input write pointer and output copy pointer becomes problematic. When copy pointer speed is slow (for lowered pitch), it will soon lag too far behind to reflect real time input events. On the other hand when copy pointer speed is fast (for higher pitch), it will quickly pass the input write pointer and read stale data from the previous iteration through the circular buffer. Therefore the copy pointer has to jump backward or forward sometimes.

The picture below shows the case of a copy pointer slowed down to speed 0.5, and it has to jump forward now and then. The read pointer forward jump is equivalent to cutting away a segment of the signal.

reading in between samples, jumping forward

The pointer jump represents a discontinuity which must be smoothed by some cross fade functionality. The fade out could start at the jump-from point while the fade in ends at the jump-to point. The pointer jump is thus bridged by a cross fade region where the overlap of fade out and fade in curve should sum to 1. The picture below is again the case of a slowed down reading speed, but now with overlapping fade out and fade in samples.

red lines: fade out samples, green lines: fade in samples

If reading speed and pointer jump distance would always be the same, fade out samples could be written in advance to the delay line and mixed with the fade in samples later. There is no dsp law prohibiting us from writing samples to 'future positions' in a delay line, although it is not regularly done. This example serves as a conceptual illustration rather than a practical approach, as an alternative interpretation of a particular procedure to help understand the concept. In practice, the system is more flexible when samples are written in their natural order to the delay line, and copied to their overlap position 'on demand': according to a reading speed and pointer jump length which may vary over time.

red lines: fade out samples, green lines: fade in samples

When copying at increased speed, the copy pointer must jump backward now and then, like in the proverbial Echternach Procession. This is equivalent to creating duplicates of signal segments. In the illustration below, every other sample is skipped but the remaining ones are all copied twice.

skipping samples, jumping backward

And here is the version with fade out and fade in samples drawn:

red lines: fade out samples, green lines: fade in samples

phase coherence

For periodic sound, phase-coherent overlap is important because repetitive phase cancellations would translate to amplitude modulations which in turn produce sideband frequencies. A phase-coherent overlap is achieved when the skipped or duplicated section matches periodicity in the input signal. It may be one period, two periods, or more. To be precise: 2ⁿ periods, where n is an integer. When a sound is perfectly periodic, an overlap would not even be necessary. But acoustic sounds are hardly ever so regular that sections can be spliced without a short crossfade. The exact shape of the crossfade function is not too critical, if only the fade out and fade in add up to unity gain. I use linear slopes, both for illustration and in the implementation.

Imagine a periodic wave like the one below where we want to skip one period, jumping from the red area to the green area:

delay buffer

Instead of hard-cutting the wave, we apply a fade out at the jump-from point (a), and a fade in at the jump-to point (b), like so:

delay buffer

Fade out and fade in segments are overlapped in the output: temporarily there are two read pointers reading from different positions in the buffer. The signal size is thus reduced by one period:

output with less periods than original

The sum signal is read at lower speed to get the original size but with lower pitch:

output as rendered at lower pitch

Depending on pitch factor, portions of the wave may be played without overlap. With pitch factor 0.8 for example, four of five periods are played back, and no more than one of these four periods needs to be a crossfade region. This is a difference with the 'dumb' pitch shifter which uses crossfade overlaps all the time, no matter what pitch factor.

For pitch factors above unity, signal segments are not skipped but duplicated. The method with fade out and fade in remains similar.

latency vs overlap quality

In order to keep latency to a minimum amount of samples, the read pointer should always be as close to the input write pointer as possible. But how close can it be allowed? This depends on our demands. If we want to secure the best matching overlap in all cases, even for periods of different length or non periodic sound, the optimal splice point should be found by cross correlation. This needs a latency more than two periods of the lowest expected frequency, because all samples of two compared sections must be in the buffer. For most instruments this would lead to a few dozen milliseconds latency. However, we could relax our demands to an optimal overlap for continuous periodic sound only, because that is where it matters most. In that case the splice positions can be derived from up to date pitch information, and the read pointer may be allowed to get very near to the input write pointer. This approach is much more flexible. Latency will be variable, with an average below 10 milliseconds. As a bonus, pitch factor modulation can be done without causing discontinuities. I prefer this technique for real time pitch shifting, because the advantages outweigh the small quality compromise. In the following sections, pitch detection is the basis for splicing.

crossfade scheduling

For pitch tranposition factors below unity, the read pointer is allowed to jump forward when it can skip a full period. In the picture below, if the read pointer is at point a and the input write pointer is at point b, the read pointer may jump from a to b. At that moment there is no information about the signal beyond point b, where the fade in should start. The fade in is performed over 'future samples'. Periodicity is assumed to be continuous, and if not, a phase-coherent overlap will not be possible anyhow. Since the crossfade can not be performed right away in the buffer, it is scheduled for the future. Crossfade length as expressed in number of output samples depends not only on period length, but also on pitch transposition factor. This factor may be changed by the user while a crossfade is going on. Therefore the crossfade scheduler keeps track of it's state in terms of input crossfade length, but increments it's phase according to actual pitch transposition. This method enables continuous manipulation or modulation of the transposition factor.

delay buffer

For pitch transposition factors above unity, the read pointer must jump backward when it approaches the input pointer too closely. Since the reading speed is faster than input sample rate, some samples for the fade out (but not all of them) must already be in the buffer, otherwise we will run out of input samples before the crossfade is completed. The buffered part of the fade out must at least be 1 - 1 / transposition factor, the rest can be future samples. For example in the case of transposition factor 1.5 the buffered and future parts of the fade out will relate as follows:

buffered part: 1 - 1 / 1.5 = 1 / 3
future part: 1 / 1.5 = 2 / 3

The higher transposition factor, the bigger must be the buffered part of the fade out. Obviously, for unity pitch there will be no read pointer jumps and a distance between read pointer and write pointer is not required. If the user suddenly changes the pitch transposition from unity pitch to a high transposition factor, no fade out samples are available in the buffer. Also, if the transposition factor is suddenly increased while a crossfade is still playing, the buffer may not contain enough samples for the fade out. A discontinuity could be audible in such cases. In my experience, this is an acceptable penalty for the case of rough parameter manipulation. The alternative would be a fixed buffered length to be on the safe side in all cases, adding latency even when it is not necessary.

crossfade length

In the illustrations so far I have sketched a crossfade overlap which has exactly the size of the pointer jump length, which in turn was one period every time. In practice these lengths can differ. There may be a reason to jump over more periods, for example when the detected input pitch is suddenly much higher and the read pointer lag is more than one period. Apart from this, the crossfade overlap length may be different from the pointer jump length. High pitch transposition factors induce many signal duplicates which are only wanted in the case of periodic sound. Transient duplicates are undesired. When the lengths of fade in and fade out are reduced, transient duplicates are statistically reduced too, which slightly suppresses the echoeic character of the output. It also means shorter latency for transposition factors above unity, because less samples need to be buffered for the fade out. I have settled on pointer jump length divided by transposition factor for the crossfade of those cases. For transposition factors below unity, the best crossfade length is one period. For transposition factors below 0.5 this is too long though, because the read pointer will want to jump forward after less than one input period and the crossfade would be interrupted. For these cases the crossfade length is defined as pointer jump length * 2 * transposition factor. For example, transposition factor 0.25 will have crossfade overlaps of 2 * 0.25 = 0.5 times the pointer jump length. This means that transients may be skipped, which adds to the lack of articulation in a far down transposed signal.

Further compromises may arise in cases where the input pitch suddenly changes and the amount of buffered samples is not sufficient for the desired crossfade length. The crossfade is then set to match the available number of samples. Most important thing is, no read pointer should ever exceed the write pointer position because it would read stale samples from a previous iteration in the circular buffer.

pitch detection latency

The main cause of artifacts in the above described pitch shifter method is latency in the pitch detection. The onset of a note with new pitch will not be processed properly because the pitch detector does blockwise analysis and will always be a bit late with pitch reports. Moreover, it cannot identify the exact point of pitch change. Fortunately this latency will only influence splice point positions and not the transposition itself. Correct splice points are most important in the case of pure continuous sounds, and these will not be spoiled by pitch detector latency. The cases where pitch detection latency effects will be noticeable, is when a periodic sound starts abruptly and without transient phase. They can be perfectly demonstrated with an electronic test signal, a sinewave of abruptly changing frequency. The first periods of the new frequency are spliced with the crossfade length of the preceding frequency, and it may look like this:

The length spoiled by mis-detection is dependent on analysis frame size and overlap of the pitch detection routine. In general it is a few dozen milliseconds. It looks like a series of discontinuities but it is amplitude modulation, producing sideband frequencies and that is what you hear - no clicks or pops. In this test signal the artifact is very perceivable when compared to the original sound, but with acoustic instruments such sudden pitch changes in an otherwise pure sound rarely happen. In practice the artifact is less troublesome than it looks here.

transient detection

Perception of latency is mainly based on delayed transients. Latency in a continuous signal is not noticeable. Therefore it is most important to reduce latency to a minimum at the moments when transients happen. A sharp transient characterized by a sudden amplitude rise is easy to detect. The signal content directly preceding a transients is normally not significant so the read pointer can be moved to the start of the transient. And because sharp transients are short, the read pointer lag will be small too, which results in low perceived latency. This is an extra trick which can be applied on top of the earlier described techniques for skipping / duplicating periods. A pitch detector can not detect the high frequencies in transients as fast as an amplitude based transient detector can do it. It is particularly useful in situations where transients follow upon low frequency content, because long periods cause long pitch shifter latency on average.

[solad~], pitch shift external for Pure Data

The above described techniques are implemented in a Pure Data class called [solad~] (for Synchronized Overlap Add Delayline). It is in fact a delay line class, specialized for pitch shifting. It forms a combo with class [helmholtz~] for pitch detection info. Class [solad~] is still in testing phase. I'm not decided yet how to distribute the stuff, but here's a temporary download of my externals library with [solad~], [helmholtz~] source and binaries plus many test patches included:

http://www.katjaas.nl/tmp/PdClassDev.zip

Be invited to download and test this stuff, but do not redistribute or deeplink cause it's a temporary version and URL. Stay in tune for the release on this page if you're interested.

^top <<home <<previous next>>