My previous page on pitch
shifting illustrated issues of a 'dumb'
time domain pitch shifter using two delay
lines. Echoes and modulations spoil the sound, for which reason such a
pitch
shifter is hardly of practical use. I revisited the method many times
though, because it's response felt snappier than other methods like
WSOLA and phase vocoder. This pages illustrates how a dumb pitch
shifter gets a bit smarter through pitch dectection while achieving an
average latency of only a few milliseconds. The result sounds
much like Digitech's Whammy pedal. No formant
preservation, but fast response. And pretty robust with monophonic
input.
Let me briefly resume the working of the basic time
domain
pitch shifter. Two playback read pointers move alternating at non unity
speed through a circular buffer with input samples. An automatic
crossover function (window) masks discontinuities occurring at read
pointer
jumps between maximum delay and zero delay. When the two output signals
are in phase, the result is optimal. But this is mostly not the case.
With the outputs out of phase, periodic cancellation will lead to very
audible amplitude modulation in periodic sound. Another adverse effect
of output overlap
is the audible duplication of transient sounds. The picture below shows
a simple pitch shifter in Pure Data, with a sinewave input and
amplitude modulation in the pitch shifted output.
![]() amplitude modulation in a pitch shifted sinusoid |
Playing around with my Pure Data patches, I found that a perfect
non-modulated output could be produced by matching parameter
settings to the frequency of an input sinewave. Basically, a maximum
delay of two times the sinewave's period length seemed to give good
result:
maximum delay in milliseconds = 2 * (1 / frequency) * 1000
= 2000 / frequency
This triggered the idea to use pitch tracker [helmholtz~]
and synchronize delay line parameters to the input pitch. A
snapshot of the Pd patch below illustrates how the output sinusoid is
no longer amplitude modulated:
![]() |
The experiment was encouraging, but it was not close to a practical
pitch shifter implementation. Everytime when the input frequency was
changed, not only would maximum delay time change abruptly, but
the actual delay time (being a phase between zero and maximum delay
time) would jump too.
This of course caused discontinuities in the pitch shifted output. The
idea of
pitch synchronization worked, but it should be implemented in a
different way.
Admittedly, I've been searching on internet for a solution first.
The wildest proposals have been published on the
topic of low latency pitch shifting. What put me on the right track was
this article by
Azadeh Haghparast et. al.
It describes how a single delay line is read with variable speed, while
splice points for the read pointer are found by cross correlating the
candidate overlap areas. The article is illuminating because it gives
comprehensive definitions and
illustrations of the problem at hand. In the end, I implemented a
similar approach, but based on pitch detection instead of cross
correlation.
The basic idea of a delay lines pitch shifter is to write input data
at a constant speed, while copying that data to the output at a
different constant speed. Since the audio card input and output run at
the same sample rate, the speed difference is simulated by
reading 'in between the samples', or by skipping samples. Obviously, it
doesn't take long before the distance between input
write pointer and output copy pointer becomes problematic. When copy
pointer speed is slow (for lowered pitch), it will soon lag too far
behind to reflect real time input events. On the other hand when copy
pointer speed is fast (for higher pitch), it will quickly pass the
input write pointer and read stale data from the previous iteration
through the circular buffer. Therefore the copy pointer has to jump
backward or forward sometimes.
The picture below shows the case of a copy pointer slowed down to
speed 0.5, and
it has to jump forward now and then. The read pointer forward jump is
equivalent to cutting away a segment of the signal.
![]() reading in between samples, jumping forward |
The pointer jump represents a discontinuity which must be smoothed
by some cross fade functionality. The fade out could start at the
jump-from point while the fade in ends at the jump-to point. The
pointer jump is thus bridged by a cross fade region where the overlap
of fade out and fade in curve should sum to 1. The picture below
is again the case of a slowed down reading speed, but now with
overlapping fade out and fade in samples.
![]() red lines: fade out samples, green lines: fade in samples |
If reading speed and pointer jump distance would always be the same,
fade out samples could be written in advance to the delay line and
mixed with the fade in samples later. There is no dsp law prohibiting
us from writing samples to 'future positions' in a delay line, although
it is not regularly done. This example serves as a conceptual
illustration rather than a practical approach, as an alternative
interpretation of a particular procedure to help understand the
concept. In
practice, the system is more flexible when samples are written in their
natural order to the delay line, and copied to their overlap position
'on demand': according to a reading speed and pointer jump length which
may vary over time.
![]() red lines: fade out samples, green lines: fade in samples |
When copying at increased speed, the copy pointer must jump backward
now and then, like in the proverbial Echternach Procession. This is
equivalent to creating duplicates of signal
segments. In the illustration below, every other sample is skipped but
the remaining ones are all copied twice.
![]() skipping samples, jumping backward |
And here is the version with fade out and fade in samples drawn:
![]() red lines: fade out samples, green lines: fade in samples |
For periodic sound, phase-coherent overlap is important
because
repetitive phase cancellations would translate to amplitude modulations
which in turn produce sideband frequencies. A phase-coherent overlap is
achieved when the skipped or duplicated section matches periodicity in
the input signal. It may be one period, two periods, or more. To be
precise: 2n periods,
where n is an integer. When a sound is perfectly periodic, an overlap
would not even be necessary. But acoustic sounds are hardly ever so
regular that sections can be spliced without a short crossfade. The
exact shape of the crossfade
function is not too critical, if only the fade out and fade in add up
to unity gain. I use linear slopes, both for illustration and in the
implementation.
Imagine a periodic wave like the one below where we want to skip one
period, jumping from the red area to the green area:
![]() delay buffer |
Instead of hard-cutting the wave, we apply a fade out at the
jump-from point (a), and a fade in at the jump-to point (b), like so:
![]() delay buffer |
Fade out and fade in segments are overlapped in the output:
temporarily there are two read pointers reading from different
positions in the buffer. The signal size is thus reduced by one period:
![]() output with less periods than original |
The sum signal is read at lower speed to get the original size but with
lower pitch:
![]() output as rendered at lower pitch |
Depending
on pitch factor, portions of the wave may be played
without overlap. With pitch factor 0.8 for example, four of five
periods are played back, and no more than one of these four periods
needs to be a
crossfade region. This is a difference with the 'dumb' pitch shifter
which uses crossfade overlaps all the time, no matter what pitch
factor.
For pitch factors above unity, signal segments are not skipped
but duplicated. The method with fade out and fade in remains similar.
In order to keep latency to a minimum amount of samples, the read
pointer should always be as close to the input write pointer as
possible. But how close can it be allowed? This depends on our demands.
If we want to secure the best matching overlap in all cases, even for
periods of different length or non periodic sound, the optimal splice
point should be found by cross correlation. This needs a latency more
than two periods of the lowest expected frequency, because all
samples of two compared sections must be in the buffer. For most
instruments this would lead to
a few dozen milliseconds
latency. However, we could relax our demands to an optimal overlap for
continuous periodic sound only, because that is where it matters most.
In that case the splice positions can be derived from up to date pitch
information, and the read pointer may be allowed to get very near to
the input write pointer. This approach is much more flexible. Latency
will be variable, with an average below 10 milliseconds. As a bonus,
pitch factor modulation can be done without causing discontinuities. I
prefer this technique for real time pitch shifting, because the
advantages outweigh the small quality compromise. In the following
sections, pitch detection is the basis for splicing.
For pitch tranposition factors below unity, the read pointer is
allowed to jump
forward when it can skip a full period. In the picture below, if
the read pointer is at point a and the input write pointer is at point
b, the read pointer may jump from a to b. At that moment there is no
information about the signal beyond point b, where the fade in should
start. The fade in is performed over 'future samples'. Periodicity is
assumed to be continuous, and if not, a phase-coherent overlap will not
be possible anyhow. Since the crossfade can not be performed right away
in the buffer, it is scheduled for the future. Crossfade length as
expressed in number of output samples depends not only on period
length, but also on pitch transposition factor. This factor may be
changed by the user while a crossfade is going on. Therefore the
crossfade scheduler keeps track of it's state in terms of input
crossfade length, but increments it's phase according to actual pitch
transposition. This method enables
continuous manipulation or modulation of the transposition factor.
![]() delay buffer |
For pitch transposition factors above unity, the read pointer must
jump backward when it approaches the input pointer too closely. Since
the reading speed is faster than input sample rate, some samples for
the fade out (but not all of them) must already be in the buffer,
otherwise we will run out of input samples before the crossfade is
completed. The buffered part of the fade out must at least be 1 - 1 /
transposition factor, the rest can be future samples. For example in
the case of transposition factor 1.5 the buffered and future parts of
the fade out will relate as follows:
buffered part: 1 - 1 / 1.5 = 1 / 3
future part: 1 / 1.5 = 2 / 3
The higher transposition factor, the bigger must be the buffered
part of the fade out. Obviously, for unity pitch there will be no read
pointer jumps and a distance between read pointer and write pointer is
not required. If the user suddenly changes the pitch transposition from
unity pitch
to a high transposition factor, no fade out samples are available in
the buffer. Also, if the transposition factor is
suddenly increased while a crossfade is still playing, the buffer may
not contain enough samples for the fade out. A discontinuity could be
audible in such cases. In my
experience, this is an acceptable penalty for the case of rough
parameter manipulation. The alternative would be a fixed buffered
length to be on the safe side in all cases, adding latency even when it
is not necessary.
In the illustrations so far I have sketched a crossfade overlap which
has exactly the size of the pointer jump length, which in turn was one
period every time. In practice these lengths can differ. There may be
a reason to jump over more periods, for example when the detected input
pitch is
suddenly much higher and the read pointer lag is more than one period.
Apart from this, the crossfade overlap length may be different from the
pointer jump length. High pitch transposition factors induce many
signal duplicates which are only wanted in the case of periodic sound.
Transient duplicates are undesired. When the lengths of fade in and
fade out are reduced, transient duplicates are statistically reduced
too, which slightly suppresses the echoeic character of the output. It
also means shorter latency for transposition factors above unity,
because less samples need to be buffered for the fade out. I have
settled on pointer jump length divided by transposition factor for the
crossfade of those cases. For transposition factors below unity, the
best crossfade length is one period. For transposition factors below
0.5 this is too long though, because the read pointer will want to jump
forward after less than one input period and the crossfade would be
interrupted. For these cases the crossfade length is defined as pointer
jump length * 2 * transposition factor. For example, transposition
factor 0.25 will have crossfade overlaps of 2 * 0.25 = 0.5 times the
pointer jump length. This means that transients may be skipped, which
adds to the lack of articulation in a far down transposed signal.
Further compromises may arise in cases where the input pitch
suddenly changes and the amount of buffered samples is not sufficient
for the desired crossfade length. The crossfade is then set to match
the available number of samples. Most important thing is, no read
pointer should ever exceed the write pointer position because it would
read stale samples from a previous iteration in the circular buffer.
The main cause of artifacts in the above described pitch shifter method
is latency in the pitch detection. The onset of a note with new pitch
will not be processed properly because the
pitch detector does blockwise analysis and will always be a bit late
with pitch reports. Moreover, it cannot identify the exact point of
pitch change. Fortunately this latency will only influence splice point
positions and not the transposition itself. Correct splice points are
most important in the case of pure continuous sounds, and these will
not be spoiled by pitch detector latency. The cases where pitch
detection latency effects will be noticeable, is when a periodic sound
starts abruptly and without transient phase. They can be perfectly
demonstrated with an electronic test signal, a sinewave of abruptly
changing frequency. The first periods of the new frequency are spliced
with the crossfade length of the preceding frequency, and it may look
like this:
![]() |
The length spoiled by mis-detection is dependent on analysis
frame size and overlap of the pitch detection routine. In general it is
a few dozen milliseconds. It looks like a series of discontinuities but
it is amplitude modulation, producing sideband frequencies and that is
what you hear - no clicks or pops. In this test signal the artifact is
very perceivable when compared to the original sound, but with acoustic
instruments such sudden pitch changes in an otherwise pure sound rarely
happen. In practice the artifact is less troublesome than it looks
here.
Perception of latency is mainly based on delayed transients. Latency in
a continuous signal is not noticeable. Therefore it is most important
to reduce latency to a minimum at the moments when transients happen. A
sharp transient characterized by a sudden amplitude rise is easy to
detect. The signal content directly preceding a transients is normally
not significant so the read pointer can be moved to the start of the
transient. And because sharp transients are short, the read pointer lag
will be small too, which results in low perceived latency. This is an
extra trick which can be applied on top of the earlier described
techniques for skipping / duplicating periods. A pitch detector can not
detect the high frequencies in transients as fast as an amplitude based
transient detector can do it. It is particularly useful in situations
where transients follow upon low frequency content, because long
periods cause long pitch shifter latency on average.
The above described techniques are implemented in a Pure Data class
called [solad~] (for Synchronized Overlap Add Delayline). It is in fact
a delay line class, specialized for pitch shifting. It forms a combo
with class [helmholtz~]
for pitch detection info. Class [solad~] is still in testing phase. I'm
not decided yet how to distribute the stuff, but here's a temporary
download of my externals library with [solad~], [helmholtz~] source and
binaries plus many test patches included:
http://www.katjaas.nl/tmp/PdClassDev.zip
Be invited to download and test this stuff, but do not redistribute
or deeplink cause it's a temporary version and URL. Stay in tune for
the release on this page if you're interested.