F0 estimation (part 2)

Some F0 estimation algorithms apply pre-processing to the speech waveform, and all use post-processing to select from multiple candidate values for F0. Most algorithms have several parameters that need to be carefully chosen.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
let's consider if any pre processing would actually help.
It's tempting to do what we did in iPAQ detection, and that is the low pass philtre, the speech way form to throw away everything except the fundamental.
That's what's happened here.
We've low pass philtre, the way form think about what Cross correlation is really doing.
Is looking for similarity between one pitch period on the next pitch period on that similarity of sample by sample.
And it doesn't matter whether the way former is complicated or simple.
We could measure self similarity in both cases on, in fact, low pass filtering, throwing away information.
Think about what it does in the frequency domain.
Let's just zoom in to make things clearer.
We're trying to estimate the frequency off the fundamental, which is this thing here.
One way to do that would throw away everything except the fundamental in the measure.
It, however, there's additional evidence about the value from all these harmonics, which, in multiples of the fundamental, so keeping all that information in the way form might actually aid.
As in recovering the self similarity or cross correlation between the way form on the lag version of that way form, so it's no store clear that low pass philtre in would be a good idea.
So some pitch estimation albums do it, and some don't another temptation.
We try and get a signal that's as close as possible to the source to what was coming out of the vocal folds on.
One way to do that is called inverse filtering.
And so let's try another son.
Invest filtering in the frequency domain.
We would try and make this spectral envelope, which currently looks like this flat.
So we try and flatten the envelope by putting the signal through a philtre, which has the inverse response off the spectral envelope.
So boost all the low energy parts on DH suppressed the high engine parts that will give us a signal, which was a bit more like a pulse train.
Our embassy building has made assumptions.
Specifically, the vocal tract is a linear predictive philtre.
In this case, on doing this inverse filtering might introduce distortions to the signal.
So again, some pitch estimation algorithms do it, and some don't let's now take a look out.
Perhaps the most famous F zero estimation method of them all.
It's called a robust algorithm for pitch tracking.
So no pitch tracking is the most common term used in the literature.
This does alter Correlation dressed up with some pre imposed processing.
Let's just understand what we're looking at in this picture.
This is not a spectra, Graham.
The access along here is time, but the vertical axis is lag.
So this is a Carella Graham.
So if we take our cross correlation plot from before, it's now along the vertical axis So if this was a spectra ground, we'd have a spectrum along the vertical axis.
Here we have a cross correlation function along the vertical axis on DH.
We're using black pixels to do no peaks.
You see that piques correspond to dark areas on the like.
Range here is different between my example on the one from the paper.
This is ah, great visual representation of the problem we're faced with with peak picking, we have to find this peak on.
We have to truck it over time.
You have to know when it stops.
The speech stops being voiced to see when it starts again and track it through time again.
And that is the fundamental period from which Of course, we can recover the fundamental frequency.
We can see why this is hard if we just do some naive picking of peaks will track for perfectly fine here.
But we'll switch up to this peak here because he looks stronger, maybe switch down to speak here.
And then we might make errors here as well.
So we're going to get errors in if they're tracking.
This would be an octave era jumping between the different peaks.
We might also accidentally pick up some of the in between peaks on.
We've got non octave errors as well.
To recover from these errors in peak picking, it's normal to do some post processing on what we will use.
This diagram for is to obtain candidate values that have zero, not the final value.
So we would say there is some candidate values here now.
There's some kind of values here, aunt here, lots of counter values, all of these possible evidences of F zero and then we'll try and join up the dots.
I understand the way of doing that to some dynamite programming.
This diagram is from a different method on DH.
The axe is a bit different here the vertical axis being transformed from lag into frequency.
Just one over on these dots are candidates, their peaks from the cross correlation function on.
We can use dynamic programming to join the dots on.
The dynamic programming will have a cost to do with how good each candidate is, how high the peak walls on to do with how likely it is to join to the next dot.
So, for example, how much FC removes so pretty much all signal processing algorithms do some pre processing.
They have a core which is doing all the work on some post processing on DH in all three of those areas in the pre processing the altar correlation itself on the post, processing lots of promises we need to choose.
We've already seen some.
For example, the window size Here are the tunable parameters in the wrapped algorithm.
There's quite a few of them now.
We need to set them through experimentation on intuition on through our knowledge of speech, for example, the range of minimum maximum values with zero, and to get the very best performance from algorithms like this, we would want to tune some or all of these parameters to the individual speaker that we're dealing with F zero range is the most obvious one made that as narrow as possible will make a few errors as possible.
Whilst auto correlation or modified articulation, cross correlation is by far the most common way of extracting FTO.
There are small tent is out there.
Here's one that you should already understand on its to use the caps from.
So if we look at a speak spectrum, which this diagram on top is a rather abstract picture off.
So that's frequency on that magnitudes.
That's just a spectrum rather idealised.
These are the harmonics.
This is the spectral envelope.
When we fit, the kept strum to this spectrum were extracting shape parameters on the lower order ones captured these grow shapes like the spectral envelope.
But eventually one of the higher order capital coefficients will just fit nicely onto these harmonics.
It will be the component of that particular qui friend see, and we'll get a large value for that capital coefficient rip lot.
The capsule coefficients along this axis and the magnitude along this axis will find eventually that one has a large value on that cui friend.
See which is a time in seconds is again the fundamental period.
That's the less common effort.
But it will work and there are various other methods out there which we're not going to go into great detail.
One would be the following.
We would construct a philtre called a comb Philtre on this philtre actually is notches at multiples off a possible value of zero.
So we hypothesised in volume of zero.
Put the speech signal through this philtre and see how much energy it removed.
And then we would vary the value of FDR of this philtre which moving up and down and remove the maximum amount of energy and that would then find is the value of ft of the signal on.
Of course, we can always throw machine learning at the problem so we could throw something like a neural net and ask it to learn how to extract of zero, obviously from some label data.
So we would need some data to do that.
It would need ground truth data note that the wrapped algorithm and so on doesn't need any ground truth or the parameters of basically set by hand on this machine learning method is not magic because it still needs some feature engineering.
It's still how's this auto correlation idea is its core on it still needs the dynamic programming post processing.
This isn't very different than really from the other algorithms.
Let's finish off by concerning how you would evaluate an F zero estimation algorithm.
Well, to do that, you would need some ground truth with which to compare.
And there are various ways of getting that one is to do what we said We don't like to do when we're actually recording data for speech synthesis, but we would be willing to do to get some ground truth data for algorithm development on.
That's to use a device that ring a graph to physically measure vocal fold activity on from that get a pretty reliable estimate of zero to compare against our algorithms estimate from the way form, we might even hand correct the contours that we extract with alluring a graph to get true values for comparison.
This, of course, is a bit tedious and a bit expensive, and so we wouldn't try not to do that ourselves.
We try and find somebody else's data, and there are various public databases available, such as this one here.
He wanted to compare the performance of your F zero estimation after them against somebody else's.
Has Syria estimation albums could make very sorts of error.
A big error would be whether we detect voicing or not correctly, that would be a voicing status era.
So the voice versus a GN voiced error.
I could express that as a percentage of time where we got it right or wrong on the second floor, there will be to get the value of F zero incorrect one.
Often when we evaluate algorithms for FC, your estimation will decompose those era into gross errors such as octave errors.
So we're off by factors of to On DH fine grained errors were off by a small amount.
And finally, let's ask ourselves whether FC your estimation is equally difficult for all different types of speech, perhaps even all speakers.
So F zero estimation albums normally assume that speech is perfectly periodic.
That's just an assumption we've made in all of that signal processing.
And if the speech is not perfectly periodic, they will perform poorly.
For example, on creaky voice, we don't expect to get very good estimates of fear of a creaky voice, not really well defined.
What the value of zero is for irregular vibrations of the vocal folds.
IPAQ detection algorithms, which you're looking for moments in time, should be able to detect the individual epochs in creaky voice.
But we don't expect them to perform as well as on motile or regular voice overall.
Then, as we move forward into modelling these signals and eventually getting on to statistical parametric speech synthesis first, we predict that it's going to be harder to vote code.
Some voice qualities, creaky voice is one of them.
Perhaps breathy voice is another one.
It's heart of oak.
Oh, them.
It will therefore be hard to do statistical Parametric synthesis off these different voice qualities.