Key concepts

Extracting certain features from the speech signal is an essential first step before further processing, such as concatenating candidate unit waveforms, modifying the prosody of a speech signal, or statistical modelling.

This video just has a plain transcript, not time-aligned to the videoIn this module, we're going to cover the signal processing techniques that are necessary to get us onto statistical modelling for speech synthesis: something called Statistical Parametric Speech Synthesis.
Specifically, we're going to develop those parameters. We call them speech parameters.
There's two parts to this module.
In the first part, we're going to consider analysis of speech signals.
We're going to generalise what we already know about the source filter-model of speech production and we're going to think strictly in terms of the signal.
So, instead of the source and filter, we will move on to thinking about more generally an excitation signal going through a spectral envelope.
When we've developed that model - that idea of thinking about the signal, and modelling the signal, and no longer worrying really about the true aspects of speech production - we can then think about getting the speech parameters in a form that's suitable for statistical modelling.
Specifically then, we're going to analyse the following aspects of speech signals.
We're going to think about the difference between epochs (which are moments in time) and fundamental frequency (which is an average rate of vibration of the vocal folds over some small period of time).
We're going to think about the spectral envelope, and that's going to be our generalisation of the filter in the source-filter model.
Now always remember to put axes on your diagrams! This bottom thing is a spectrum, so the axis will be frequency and the units might be Hertz and magnitude.
In the top diagram, the axes are of course time and amplitude, perhaps.
Just check as usual before going on that you have a good knowledge of the following things.
We need to know something about speech signals.
We need to know something about the source-filter model.
And we also need to know about Fourier analysis to get ourselves between the time domain on the frequency domain.
What do you need to know about speech signals?
Well, if we take this spectrum, we need to be able to identify F0 (the fundamental): that's here.
We need to be able to identify the harmonics, and that includes the fundamental: that's the fine structure.
We need to understand where this overall shape comes from, and so far we understand that is coming from the vocal tract frequency response.
We might think of it as a set of resonances - a set of formants - or more generally as just the shape of this envelope.
So that's what we need to about speech signals.
We've got a conceptual model in mind already, called the source-filter model.
That's really to help us understand speech production and in a more abstract way to understand what we see when we look at speech signals, particularly in the frequency domain: in that spectrum.
Our understanding of the source-filter model at the moment is that there is a filter.
The filter could take various forms such as linear predictive, which is essentially a set of resonances.
We know that if we excite this - if we put some input into it - we will get a speech signal out.
The sort of inputs we have been considering have been, for example, a pulse train which would give us voiced speech out.
Or, random noise which will give us unvoiced speech out.
So that takes care of the source-filter model.
Finally, we just need to remind ourselves what Fourier analysis can achieve.
That can take any signal, such as the time-domain speech waveform, and express it as a sum of basis functions: a sum of sinusoids.
In doing so, we go from the time domain to the frequency domain.
In that frequency domain representation, we almost always just plot the magnitude of those sinusoids. That's what the spectrum is.
But there is correspondingly also the phases of the those components: that's the phase spectrum.
So, strictly speaking, Fourier analysis gives us a magnitude spectrum, which is what we always want, and a phase spectrum (which we often don't inspect).
To exactly reconstruct a speech signal, we need the correct phase spectrum as well.
So, in the first part of the module we'll consider what exactly it is we need to analyse about speech signals.
Where we're going is a decomposition of a speech signal into some separate speech parameters.
For example, we might want to manipulate them and then reconstruct the speech waveform.
Or we might want to take those speech parameters and make a statistical model of them, that can predict them from text, and use that for speech synthesis
The first thing I'm going to do is drop the idea of a source-filter model mapping onto speech production, because we don't really need the true source and filter.
In other words, we don't really normally need to extract exactly what the vocal folds were doing during vowel production.
We don't really need to know exactly what the filter was like, for example where the tongue was.
So what we're going to do - rather than thinking about the physics of speech production - we're just going to think much more pragmatically about the signal.
Because all we really need to do is to take a signal and measure something about it.
For example, track the value of F0 so we can put it into the join cost function of a unit selection synthesiser.
We might want to decompose the signal into its separate parts so that we can separately modify each of those parts.
For example, the spectral envelope relates to phonetic identity and the fundamental frequency relates to prosody.
Or we might want to do some manipulations without actually decomposing the signal.
In other words, by staying in the time domain.
For example, we might want to do very simple manipulations such as smoothly join two candidate units from the database in unit selection speech synthesis.
So here's a nice spectrum of a voiced sound.
You can identify the fundamental frequency, the harmonics, the formants and the spectral envelope.
We're going to model this signal as we see it in front of us.
We're not going to attempt to recover the true physics of speech production, so we're going to be completely pragmatic.
Our speech parameters are going to be things that relate directly to the signal.
We're not going to worry whether they do or don't map back onto the physical speech production process that made this signal.
So, for example, there is an envelope of this spectrum.
Call it a "spectral envelope".
Just draw it roughly.
Now that clearly must be heavily influenced by the frequency response of the vocal tract.
But we can't say for sure that that's the only thing that affects it.
For example, the vocal fold activity isn't just a flat spectrum.
It's not a nice perfect line spectrum, because the vocal folds don't quite make a pulse strain.
So the spectral envelope is also influenced by something about the vocal folds: about the source.
We're not going to try and uncover the true vocal tract frequency response.
We're just going to try and extract this envelope around the harmonics.
Before we move on to the details of each of these speech parameters that we would like to analyse or extract from speech signals, let's just clear up one potential for misunderstanding.
That's the difference between epochs and fundamental frequency.
In a moment we'll look at each of them separately.
It's very important to make clear that these are two different things.
Obviously, they're related because they come from the same physical part of speech production.
But we extract them differently, and we use them for different purposes.
Epoch detection is perhaps more commonly known as pitch marking, but I'm going to say "epoch detection" so we don't confuse ourselves with terminology.
It's sometimes also called glottal closure instant detection or GCI detection.
This is needed for signal processing algorithms.
Most obviously, if we're going to do pitch-synchronous signal processing, we need to know where the pitch periods are.
For example, in TD-PSOLA, we need to find the pitch periods, and epoch detection therefore is a necessary first step in TD-PSOLA.
More simply, even if we're just overlap-and-adding units together, we might do that pitch synchronously so that's also kind of TD-PSOLA but without modifying duration of F0.
Again, we need to know these pitch marks or epochs for that.
A few vocoders might need pitch marks because they operate pitch-synchronously.
So that's epoch detection.
F0 estimation, perhaps more often called "pitch tracking", and again, I'm going to try and consistently say "F0 estimation" to avoid this confusion between "pitch marking" and "pitch tracking" which sound a bit similar.
F0 estimation involves finding the rate of vibration of the vocal folds.
It's going to be a local measure because the rate of vibration changes over time.
It's not going to be as local epochs.
In other words, we're going to be able to estimate it over a short window of time.
F0 is needed as a component in the join cost.
All units selection systems going to use that.
We might also use it in the target cost.
If we've got an ASF-style target cost function, we will need to know the true F0 of candidates so we can compare it to the predicted F0 of the targets and measure that and put it as a component of the ASF target cost.
Almost all vocoders need F0 as one of its speech parameters.
So just to make that completely clear, because it's a very common confusion, epoch detection is about finding one point in each pitch period, for example the biggest peak.
That looks trivial on this waveform.
But in general, it's not trivial because waveforms don't always look as nice as this example.
If we think we could do this perfectly, then this will be a great way to estimate F0, because we could just take some region of time, some window, and we could count how many epochs per second there were.
And that would be the value of F0.
Now epoch detection is a bit error prone.
We might miss the occasional period.
So when we estimate F0, we don't tend to always do it directly from the epochs.
So, separately, from epoch detection, F0 estimation is the process of finding, for some local region (or window) of the speech signal, the average rate of vibration of the vocal folds.
Of course, that window needs to be small enough so we think that that rate is constant over the window.
Hopefully, already, the intuition should be obvious: that, because F0 estimation can consider multiple periods of the signal, we should be able to do that more robustly than finding any individual epoch.

Log in if you want to mark this as completed
Excellent 53
Very helpful 15
Quite helpful 13
Slightly helpful 2
Confusing 4
No rating 0
My brain hurts 1
Really quite difficult 1
Getting harder 7
Just right 75
Pretty simple 3
No rating 0