Wrap-up

Brief coverage of the end-to-end process of building and using an HMM-based speech synthesiser, including some mention of speech parameter generation and duration modelling.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
Let's summarise this whole process.
This practical process off hidden Markov model with regression trees based speech synthesis.
We start the text to speech pipeline with linguistic processing that just turns text into linguistic features in the usual way.
So you got the same from tenders you always have.
We then flattened the structure so we don't have trees and all of that exciting linguistic stuff.
We just attach context to the phone in level.
So we flatten the linguistic structures known for this flattened structure.
We would like to create one context, depend hmm, for every unique combination of linguistic features.
But we realised, Actually, that's impossible.
So we need to have a solution to that problem.
The solution happens.
Well, we're training the H M M's.
Of course, we need label speech data.
This is just under supervised learning on all of those models that don't have any training data or don't have enough training data where this single elegant solution and we could describe that as if you like re parameter rising the models using a regression tree.
In other words, the models don't have as many free parameters it first looks.
In other words, if we took the total number of phones in context times five states, and not this very large number.
That's not how many promises model housing as far far fewer than that because it's being re parameter rised by the regression tree.
In other words, we keep any aboutthe same leaf the tree over and over again.
For many different states of different models, they are sharing their parameters to synthesise from the hidden Markov models.
We take the front end that gives us this flat context dependent phone sequence.
Each of those is just the name of a model.
We pull that model out of our big bag of models, the models of sort of empty.
The states don't have parameters.
So we go to the regression tree and we query it, using the name of this model to fill in the parameters of the model.
Well, then, just concatenation of sequence of context dependent models for the centres were trying to say.
We somehow generate speech parameters from that.
We don't know how to do that.
Yet you're about to say something in a moment on.
Then we'll use a vocoder to convert the speech promise to away form and we already know how to do that.
As I warned you, near the start of this module, we're not going to go into great technical detail about the generation algorithms from Regression Tree plus hmm, Synthesisers.
There's quite a lot of literature about it, and it looks like it's sort of prominent and difficult problem, however, because hmm, synthesis is probably going to be completely superseded by in Your Net synthesis.
We're not needing deep understanding off this generation algorithm.
It's really just smoothing.
It's just fancy looking, smoothing.
So what we'll do then we'll state how we would generate from Hidden Markov.
Model will spot a problem with that, and I'll give you just graphically the solution to that problem.
Well, since hidden Markov models are generative models, generating from them should be straightforward on DH.
Yes, of course it is.
We could just run it as a generative model.
For example, we can take a random walk through it.
Let's just assume that it's a hidden Markov model with transitions between states and self transitions that give us a model of duration of staying in that state.
Once we're in a state, we need to generate some output we need to emit on observation.
Well, we should follow the most basic principle there is in machine learning, and that's the maximum likelihood principle and just do the most likely thing.
I mean, what else would you do? You wouldn't do the least likely think so.
We generate the most likely output when the most likely value that you can generate from a Gaussian distribution is, of course, the mean.
Because that's where the probability density is the highest.
So we'll just generate the state.
Mean, if we stay in one hidden Markov model state for more than one frame, we'll just turn it the exact same value each frame because it's mark off because we don't remember what we did the last frame on because we're following the maximum.
Likely the principal would do the same thing every time.
So for as long as we stay in the one state will generate constant output, that doesn't sound quite right, but that's what we're doing so far in this is a mark of model.
This self transition is the model of duration.
For example, we might write some number on it, and that's a probability of staying in the state will get some rather crude exponentially decaying duration distribution from that that's not good enough for synthesis is good enough for recognition.
But in generation we need to generate a plausible duration, and this model actually gives the highest probability to the shortest possible duration of one frame.
That's not good.
So we actually need a proper duration model.
So I hit the mark off model.
I looked like this, and it doesn't have this self transition.
It has some other model of duration that says how long we should stay in this state for another was how many frames of should emit from the state.
That's an explicit duration model, not just this rather crude self transition.
So the model is actually technically no longer hidden Markov model.
It's Markoff from state to state.
But within a state, there's an explicit duration model, so you'll see in the literature that these air called hidden semi Markov models Seminars in the half.
Markoff, where is that model of duration coming from? Well, we'll use some external model.
Basically, what do we know? What we know the same stuffers for predicting the other speech parameters we know name of the model on which state we're in.
So I have a bunch of predictors.
It is linguistic context.
In other words, name of model on where we are in that model on.
We have a thing want to predict.
Predict is just the duration stated in frames in these acoustic clock units on, we just need to regress from one to the other.
Just another regression problem.
So let's just use another regression tree straightforward.
It's going to be predicting duration at the state level.
States are sub phone units going predicted duration per state on by adding the situations that would would get the duration ofthe the phone itself.
So this description of duration modelling was very brief.
And that's also deliberate because so far we using a regression tree to do that, which is not a great model.
So we can talk a little bit more about duration modelling when we go on to more powerful things than regression trees.
And that's later in the neural net speech synthesis part.
But we'll take the regression tree for duration, as is.
It will be okay, but we can beat it later.
And let's think about this problem off, generating constant output for as long as we stay in an individual state.
Otherwise we could say it's piece wise, Constant store picture of that.
See why that's really bad.
He's a hidden Markov model back to hidden Markov model.
So I've got transitions here.
This hasn't got the explicit duration model, but it doesn't matter.
The principle is exactly the same, and I'm going to generate from it.
And I'm just going to do this for one dimensional feature.
So these really are just uni variant gas ians now.
So I'm going to generate from against time, which is going to be in frames.
I'm going to generate some speech parameter pick any speech barometer you like.
I don't know what your favourite speech parameters.
I will pick up zero.
So this is going to have zero.
Let's say it hurts on.
We go to general zero from this model as we move through it from left to right.
What happens when we do that? Well, we'll just generate a constant value for as long as we stay in each state on the jet trees, just joining those things up just to make that clear how that works.
It's the galaxy in from each state is in the same dimension is the speech parameter.
So that's the Gaussian over F zero on DH.
This was the mean off that parameter, and we just generate the mean We joined those up.
We get this rather chunky looking thing, and I've never seen an F zero control look like that.
If we play out back, that's going to sound pretty robotic.
That can't be right.
What we need is to generate something smooth.
We need to join up those means, but we don't want to be completely constant within each state.
Essentially want a smooth version of that trajectory? Maybe that looks like this now in the Literature for Hidden, semi Markov model based speech synthesis, which is often still written, is Hidden Markov model based speech synthesis.
We will find algorithms to do this in a principled, on correct probabilistic way, but we can really just understand this as some simple smoothing That's good enough for our understanding.
At this point.
The correct algorithm from doing this has a name.
It's called maximum Likelihood parameter generation, and it pays attention not just to the mean but also to the standard deviation of the Galaxy ins on DH really importantly to the slope.
How fast speech parameter can change.
That would be really important f zero, because we can only increase or decrease zero to certain rate their physical constraints on what are muscles Khun do, for example, so am L.
P G pays attention to the statistics not only of the speech parameter will also off its rate of change over time, in other words, off its Delta's on, of course, and we can also put in Delta Deltas.
So it's a smoothing that's essentially learned from data to smooth in a way that's the most natural in the way that the data itself is smooth.
I'm not going to cover.
Not here.
It's available in papers if you wish to understand it.
It's been shown that rather simple ad hoc smoothing actually achieved the same sort of outcome in terms of natural nous as the correct M.
L.
P G.
So it's really okay to think of Emma.
PG as smoothing.
That's the bare bones, then really the bare bones off Hmm Bay Statistical Parametric speech synthesis.
It's just a first attempt, and it's being deliberately kept.
That's a slightly abstract in high level for two reasons.
One is thought, the technical details of better gain from readings.
But the main one is that this method of synthesis is slowly becoming obsolete.
Well, it's not gone yet and is slowly being replaced by your Net based speech synthesis.
Let's just understand what we've achieved so far.
That was our first attempt at statistical Parametric speech synthesis on DH.
We did it by choosing models that we know well.
That seems a little naive that we've decided only to choose some models we've already used in the past.
But it's actually perfectly sensible, and the reason is that we've got good algorithms for those models.
We understand very, very well how to use it.
A Markov models.
This idea of clustering them is well established in automatic speech recognition.
We've re described it ours regression here.
But the ways of doing that are very well understood and very well tested On DH, for example, they're also good software implementations off him, So I got algorithms for building the trees.
We've got really good algorithms for trained.
The hidden Markov models themselves the ban on Welsh algorithm, which still applies when the models being re privatised in this way, we're not going to get into how that works.
But we're just going to state that a model that's parameter rise using a regression tree is no more difficult to train than one that has its own parameters.
Inside, each model state doesn't change.
We've got rather poor models, but really good algorithms.
That's a common situation in machine learning.
However.
Regression trees are too bad, although we can train them and they're really faster run time.
We can inspect them if we really want to.
They're somehow human friendly.
They are the key weakness of this method.
We really must replace them on.
We're going to replace them with something a lot more powerful.
However, there is something really good about this.
Hmm, plus regression tree framework.
And that's the galleons we like guardians.
They're mathematically well behaved and convenient on.
We understand them very well, but even more importantly, they've been used for so long.
An automatic speech recognition that we can borrow some really useful techniques on some key techniques that we might borrow will be something like model adaptation that would allow us to make a speech synthesiser That sounds like a particular target speaker based on a very small amount of their data, and that would be a very powerful thing to do.
Our ability to do that comes directly from the use of gas Ian's and the fact that methods were developed for speech recognition to do speaker adaptation that we can pretty much directly borrow into speech synthesis.
Okay, where next? Well, a better regression model, of course, on the one we're going to choose should be no surprise.
It's also very general purpose.
He can learn arbitrary mapping Tze and it's on your network the inputs and outputs to this network essentially going to be the same as to the regression tree.
So the predict ease and predictor basically the same you know there was.
We're still going to use a vocoder that will limit our quality.
Later.
We'll try and fix this problem by generating way forms by concatenation or maybe even directly generating out off the neural network.
It's worth noting, though, that what we win by using your network over regression tree.
We also lose a little bit because there's no gassings involved anymore.
Think about what those gals Ian's are there in the acoustic space they're in the same space is the speech parameters.
In other words, somewhere in our system there are whole load of gas.
Ian's, who's mean is a value of F zero on.
So if we wanted to change the zero of the model in some simple, systematic way, may we just like to make the whole system speak at a slightly higher pitch.
We know which parameters we can go to and change.
We could just go multiply them all by some constant greater than one, some very simple transform that would change the model and change its output.
That's not going to be so easy in your network because the parameters are inside the model, and it's not obvious which parameters.
Which thing is a very distributed representation? We do lose something.
We'll come back to that much later.
So to wrap up, we've got a method for doing speech synthesis framed as regression.
We've tried a very simple regression model.
Are we going to know you's a better regression model? And so we now need to learn about your networks, and that's coming up in the next module

Log in if you want to mark this as completed
Excellent 41
Very helpful 5
Quite helpful 2
Slightly helpful 0
Confusing 0
No rating 0
My brain hurts 0
Really quite difficult 0
Getting harder 3
Just right 44
Pretty simple 1
No rating 0