Epoch detection

Epochs are moments in time, often defined as the Glottal Closure Instants, in voiced speech. Locating them consistently is necessary for some types of signal processing, such as Pitch Synchronous Overlap Add (PSOLA) methods.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoLet's start by detecting epochs: most commonly called "pitch marking".
Here's the most obvious use for pitch marks: it's to do some pitch-synchronous, overlap-and-add signal processing.
Here, I've got a couple of candidate units I've chosen during unit selection.
I would like to concatenate these waveforms.
I would like to do that in a way that is least likely to be noticed by the listener.
So, take the two waveforms and we're going to try and find an alignment between them - by sliding one of them backwards and forwards - such that when we crossfade them (in other words, overlap-and-add them) it will look and sound as natural as possible.
If we move one of the waveforms side to side and observe its similarity to the top one, there will be a point where it looks very similar.
The very easiest way to find that is to place pitch marks on both signals.
So if we draw pitch marks on them - that's the pitch marks for the top signal, and pitch marks for the bottom signal - we can see that simply by lining up the pitch marks, we'll get a very good way of crossfading the two waveforms.
It will be pitch-synchronous.
Our overlap-and-add procedure will do the following:
It will just apply a fade-out to the top waveform, so apply some amplitude control to it, where it will be full volume here, turn the volume down. And at the same time, the bottom waveform: turn the volume up and fade it in.
Then we just add the two waveforms together.
That's pitch-synchronous overlap-and-add.
And if we've got pitch marks, it's very simple to implement that.
So where do these pitch marks come from?
Well, one tempting thing we might do is to actually try and record the vocal fold activity directly.
That's what we used to do.
Before we got good at doing this from the waveform, we might put a device called the laryngograph on to the person speaking, and record in parallel on a separate audio channel the activity of the vocal folds.
That's this signal called Lx.
That signal's obviously much simpler than the speech signal.
It's closer to our idealised pulse train, and it's really fairly straightforward to find the epochs from this signal (not completely trivial, but fairly straightforward).
However, it's very inconvenient to place this device on speakers, especially when recording large speech databases.
For some speakers, it's hard to position: we don't get a very good recording of the Lx signal.
So we're not going to consider doing it from that signal.
We're going to do it directly from the speech waveform.
Let's develop an algorithm for epoch detection.
I'm going to keep this a simple as possible, but I'm going to point to the sort of things that in practise you would need to do to make it really good.
Our goal, then, is to find a single, consistent location within each pitch period of the speech waveform.
The key here is: consistent.
The reason that should be obvious from our pitch-synchronous overlap-and-add example.
We want to know that we're overlapping waveforms at the same point in their cycle; for example, the instant of glottal closure.
The plan for developing this algorithm then is: we'll actually try and make the signal a bit more like the Lx signal.
In other words, to make the problem simpler by making a signal simpler.
We'll try and throw away everything except the fundamental.
We'll try and turn the signal into a very simple-looking signal.
Then we'll attempt to find the main peak in each period of that signal.
It will turn out that peak picking is actually a bit too hard to do reliably.
But we flip the problem into one of finding zero crossings, and that's much easier.
So to make that clear then in this example waveform here: we're looking for a single consistent point within each pitch period.
Here, the obvious example would be this main peak.
We're looking for these points here in the time domain, and the algorithm is going to work in the time domain.
We're not going to transform this signal.
We're just going to filter it and then do peak picking, via zero crossings.
To understand how to simplify the signal to make it look easier to do peak picking on, we could examine it in the frequency domain.
Here's the spectrum of a vowel sound.
If we zoom in, we'll see the harmonics very clearly.
This is the fundamental.
That's what we are looking for, but we're not looking for its frequency, we're looking for the location of the individual epochs in the time domain.
The reason the waveform looks complicated, is because there's energy at all of these other frequencies as well, mixed in, and they're weighted by the spectral envelope (or the formants).
All of this energy here is making the waveform more complex and is making it harder to find the pitch periods.
So we'll simplify the signal, and we'll try and throw away everything except the fundamental.
If we do that, it will look a bit like a sine wave: a bit like a pure tone.
How do we throw away everything except the fundamental?
Well, just apply a low pass filter: a filter whose response looks something like this.
Its response is 1 up to some frequency; in other words, it just lets all those frequencies through, multiplied by 1.
Then it has some cutoff, and then it cuts away all of these frequencies.
And so this part here is passed through the filter.
It's called the pass band.
All of this stuff is rejected by the filter.
In other words, it's amplitude is reduced down to zero.
I should point out that this perfect-looking filter like this is impossible in reality.
Real filters might look a bit more like this; they have some slope.
Nevertheless, we can apply a low-pass filter to get rid of almost all the energy except for the fundamental.
So if we low-pass filter speech, we'll get a signal that looks a little bit like this.
It's almost a sine wave, except it varies in amplitude.
The frequency of this sine wave is F0, and it's now looking much easier to find the main peaks: these peaks here.
Direct peak pickings is a little bit hard.
Let's think about a naive way that we might do that.
We might set some threshold like this, and every time the signal goes above it, we'll find a peak.
But if the signal's amplitude drops a lot, we might not hit that threshold.
So, we might miss some peaks.
If we set the threshold very low, we might start picking up crossings of the threshold where there's just a bit of noise in the signal.
This is a bit of unvoiced speech where there happened to be some low frequencies around F0 that got through, but it's not the peaks we're looking for.
Direct peak picking is hard.
So what we're going to do is we're going to turn the problem into one of finding zero crossings.
The top waveform is the low-pass filtered speech and the bottom waveform is just its derivative: I have differentiated the waveform.
What does that mean? That means just taking its local slope.
So at, for example, this point, the local slope is positive; at this point, the local slope is negative; and importantly on the peaks the local slope is about zero, although that's true about these peaks as well.
So the waveform on the bottom is the differentiation or derivative.
We might just write that as "delta".
We're now going to find these points because they will correspond to the zero crossings in the bottom waveform.
Now, just to find the top peaks we're looking for where the slope changes from positive negative.
So we're looking for crossings from positive to negative.
For example, this peak can easily be identified by where the signal crosses the boundary here.
So what we've done so far is low-pass filtered the speech waveform and then take the derivative (or the differential).
That could be easily done as simply as just taking the difference between consecutive samples in the waveform.
The result of this very simple algorithm is this.
We have now the original speech waveform, which got low-pass filtered, differentiated.
We found all the zero crossings that were going from positive to negative, and that's what these red lines are indicating.
And this gives us a consistent mark within each pitch period.
Now these marks aren't exactly aligned with the main peaks.
We would need some sort of post processing to make that alignment, but we've done pretty well with such a very simple algorithm.
However, we see some problems.
For example, we're getting some spurious pitch marks where there's no voicing just because this unvoiced speech happens to have some energy around F0 by chance, and that happened to lead to some zero crossings.
So let's just summarise that very simple algorithm.
It's typical of many signal processing algorithms in that it has three steps.
The first step is to pre-process the signal: make the signal simpler to - in this case - remove everything except F0.
So the pre-processing here was just simply a low-pass filter.
There is then the main part of the algorithm, which is to do peak picking on that simplified signal.
Peak picking was too hard, but we could differentiate and do zero-crossing detection, which is easy.
Then put some improvements in that algorithm to get rid of some of the spurious zero crossings.
For example, the ones that happened in unvoiced speech.
We might run a smoothing filter across the signal that retains the main shape of the signal but gets rid of those little fluctuations where we got the spurious zero crossings in unvoiced speech.
Then find the crossings and put pitch marks on each of them.
And finally, like almost all signal pressing algorithms, not only does it have pre-processing, it also has some post-processing and - in the case of pitch marking - that might be to then align the pitch marks with the main peaks in each waveform.
So in this diagram, that might mean applying some offset or time shift correction to try and line them up with the main peak.
So that's epoch detection or "pitch marking".
Those pitch marks are just timestamps.
They'd be stored in the same way we might store a label.
They're just the list of times which we think there's a pitch period in voiced speech.
They will be used, for example, in pitch-synchronous overlap-and-add signal processing.
For example, something just a simple as just concatenating two candidates in unit selection.