HMM speech synthesis, described as context-dependent modelling

This is the conventional way to describe HMM-based speech synthesis. It corresponds to the way a system would be built in practice, and is similar to HMM-based automatic speech recognition using triphone models.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
So we've described synthesis a regression tusk with a hidden Markov model to do the easy kind of counting part.
You do the beginning, then the middle and the end of each phone, say, on the regression tree, to do the rather harder problem off predicting speech parameters.
Given where you are in each phone, how far through the model what the phone is, what its context is.
That was the name of the model that still remains a little bit abstract and in particular it's not completely clear how to get the regression tree.
How would we build such a thing? And so I think a good way to understand how that happens is to actually take a rather more conventional on DH practically oriented view of the whole problem on.
Just think about how we make these context dependent models, how we do parameter sharing something called tying and then to realise that that tying, which is gonna be done with the tree is the regression tree.
And then we'll come full circle and we'll see that our rather abstract synthesis as a regression tusk explanation makes a lot of sense.
After we then see this practical way of deploying that to build a real system.
So we done with the regression explanation.
We're now going to take the context dependent modelling explanation.
And I've tagged the slides with that through here so you can see what's going on might be a good idea.
First, they would have just a little recap about these linguistic features and what's going on there.
We've got this structured thing.
It's the utterance data structure in festival Andi.
We're basically flattening that.
Another was were touching everything that we need to hold on to everything that we need to know about this to the phone.
So we write the name of the phone.
So it's phoney name, left and right context fanatically and its surrounding context.
Prasad Ecklie.
That might be things like syllable stress.
You might be really basic things like its position.
Is it at the beginning of a phrase at the end of a phrase.
So this is a recap of how we did that in unit selection, but it's basically the same thing.
We have this unit here, this abstract target unit, and it's occurring in some context, not unit in context, then has all the context written onto the unit itself, says there locally.
We don't know if you have to look around it.
It's written onto the name of the unit itself.
That's what I mean when I'm saying flattening So the sequence off units in this case can be re written as the secrets of context dependent units.
So they're going to get rather complicated looking names because the name is going to capture not just what they are, but what context they're occurring in.
Let's see an example of that for your favourite sentence.
So here's the first sentence of the Arctic corpus author off the Danger Trails Phillips deals, et cetera.
It's just the beginning of that carries on on DH.
Each phone is written here.
So again we're using H.
D.
K notation.
So the centre phone occurs between the miners and plus signs.
So it's this or ah, off the et cetera when we've constructed now in machine friendly form.
So this isn't very human friendly in machine friendly form the name off the hidden Markov model that we're going to use to generate that sound.
That phone in that context, for example, when we come to generate the Schwab in there.
We're going to need a model called this.
So no big set of models.
We better make sure that we have a model that's called that all this horrible stuff here is capturing position and super segmental stuff Just encoded in some way on this is the Quinn phone.
Of course, what we're doing here actually is taking a very much an automatic speech recognition view, and that is that we'll write down context.
Bennett models will realise that it's impossible to actually train them on any real data set, and then we'll find a clever solution to that.
So let's progress through now.
This complimentary explanation the one that's much closer to practise to how you would actually build a real system like this.
And that's to describe everything as a context dependent, modelling problem.
You're doing the readings.
You'll see that Taylor calls this context sensitive modelling, But that's the same thing to any reasonable size data set.
We cannot possibly have every would've called unit type here, so every phone in context, it's not possible to have atleast one of each in any big data set.
Of course, we need a lot more than one to train a good model, we might want 10 or 100.
And the reason for that is that context has just been defined so richly.
Quinn phone Plus, all this other stuff actually spans the whole sentence because we might have position in sentences, a feature.
And that means that almost every single token occurring in the training data set is the only one of its type.
It's unique.
There might be a few that have more than one token, but not many.
Most of them occur once that do occur on the vast majority that are possible.
Don't care at all because the number of types is huge, so created a problem for ourselves.
We would like to have context dependent modelling because context affects the sound that's very important for synthesis context.
Independent modelling would sound terrible for speech synthesis, but those context dependent models are so great in their number of types that we basically don't see hardly any off him.
Just see a small subset off them.
In our training data was a pair of highly related problems.
We need yourself.
We have some things that occur once we could train a bad model for a poorly trained model on.
There are lots and lots of things that don't care a tall when we want to model for us.
Well on the solution to both those actually the same thing.
And that is to share parameters amongst groups of models.
So we're going to find models that we think basically are off the same sounding speech Sounds in context.
We're going to say they're close enough.
We got she's used the same model.
Let's just restate that in a slightly more fine grained way instead of just primary share amongst groups of models.
Similar, we might do it amongst groups of states, so we might not tie one whole model to another whole model.
We'll just do it state by state.
It achieves the same thing, just with better granularity.
So a little bit of terminology flying around there.
Let's just clarify what we're talking about here.
A model is a hidden Markov model.
Let's say it's got five emitting states and it's off a particular phone in a particular context, and so a model is made up of states and in the States are some parameters a multi, very Gaussian ready to emit observations and do an actual speech generation when we say parameter sharing amongst groups of similar models who could equally well say sharing amongst groups of similar states on models are nothing more than their parameters.
Hidden Markov model has the promises in the states and the parameters on the transitions, and that's it.
So we talk about parameters models pretty much the same thing that we're talking about.
The core problem, then, is how do you train a model for a type for a phone in context that you have too few examples off and too few includes none.
Well, let's forget the things that have no examples.
That seems almost impossible at the moment.
Let's just think about the things that have a single example a single token and occurrence in training set.
We could train a model on that.
They just will be very poorly estimated on so it wouldn't wear very well if it was a speech recognition or for speech synthesis.
We need more data, and so we could find a group of models, each of which brings one training token to the table on.
They share their training data.
If they share their training data, they'll end up with the same parameters.
And so that's the same as saying that they're going to share parameters so we could pull training data because groups off models I've said types here that'll increase amount of data, you'll get much better estimate of model parameters, and you'll end up with whole groups of models.
All actually have the same underlying parameters.
That's why we say they're sharing their parameters or that those models or their states are tied.
How to decide which groups of models should be doing the sharing of data and then the consequence of which is sharing or tying of parameters? Well, let's use some common sense.
Think about some phonetics, so the key insight is going to be.
The key assumption we're making throughout this is that there are many different context that exit rather similar effects on the current phone.
And so you don't really need a model for every different context.
You can use the same model for a whole bunch of contexts because they're so close.
It's good enough on that lets you share data, so we'll group context not by rule or anything like that, but according to the effect they have on the current sound.
So imagine some sound occurring in some context.
We got this are in Pat.
But that's imagine that this sound on this sound are so similar that we could use the same models generated neither case.
In other words, it doesn't matter where the left context is per or Burr used the same model now, in this case, a good way of detecting that put on.
But the same is that they have the same place in articulation, representing the context not simply Justus phones but as their genetic features.
Sounds like a smart thing to do, and that's quite common.
So will represent phones, not as one hotting codings out of the phone set, but lots of one hotting codings out ofthe place manner voicing whatever the Phanatic features you would like to use now, it could actually try and sit down to write rules about that, to express our knowledge.
That example, on the previous slide that would probably work.
We could say anything that's got a thing with the same place that the left could share some parameters that's going to run into trouble for lots of reasons.
First of all, it's gonna be really hard on DH.
Second of all, machine learning is always going to be better.
We're going to learn these sharing Tze from data but the model will end up with will look very much like the one we might have written by hand.
This would be a whole bunch of basically if then else rules on DH, we're going to do it in a way that actually pays attention to the data that we've got at hand.
So for any particular database, we're going to find groupings that actually appear to make a difference to the acoustics to the sound.
And if he doesn't make a difference, we'll share the parameters on the other good reason for living from data is that we can scale the complexity of thes rules of granularity, of the sharing according to how much data we have.
And I think it's reasonable to just state that the more data we have, the more models we could afford to train.
In other words, if models of sharing parameters they need to share less and less as we have more and more data and in the limit would have a separate model for every single sound in every single context.
We'll never get there because that would need almost infinite data.
So if any real amount of data, there's a certain amount of sharing of parameters that's optimal on weaken scale that according to how much data we have, that's a very good property of machine learning.
In other words, if you have more data, you could have a model with more parameters, and that's normally what you'd want to do in machine learning.
But they're fanatic.
Knowledge that we've expressed in these rules is perfectly reasonable.
It's sound.
It's just hard to get it manually down into some rules.
We want to get that out from data.
We do want to base this on some fanatic principles on the way that we do.
That is with a decision tree.
The decision Tree combines knowledge and learning from data in a very nice way.
The knowledge is in feature engineering on DH in writing down the questions about those features.
So what do we query? For example, we might query frenetic features like neutrality, or we might query identities of phoney names themselves.
That's where our knowledge has gone in on the data, tells us how to arrange these questions about features in a tree to get the best possible performance.
This is just a regression tree.
It's the regression treat we're talking about rather abstract Lee In the previous section.
The predictors are the linguistic features they might have been explicitly stacked into a vector and turned into binary or not.
They might be just implicit in the names of models amounts of the same thing.
So these questions about predictors these the answers to those questions on DH.
If we answer the questions in the order, the tree specifies, we end up is a leaf.
This is a leaf node, and then the leaf node is thie output of the regression in the Lewis and continuous parameters.
Now this picture is relating to automatic speech recognition where the terminology is that these states have ended up down at the same leaf and they get tied together tying means they pull their training data and train a single set of parameters from that, which means they'll have the same value of the parameter.
But if we think about that as regression, it's the values of those parameters.
That's the output of this model This is a regression tree where the predictors are effectively taken from the name of the model that we're trying to give parameters to.
On the predict E is a set of parameters itself the mean in various of the galaxy.
Um, so we know quite a lot about CART already, and we're not going to go back again through the training algorithm.
We do, though, need to just think about one thing, and that's when we're building the tree.
But each split.
We considering many alternative questions.
We've got a big, long list of questions is that a vowel to the right is a nasal to the right is a stop to the right is a B to the right is a Peter the right, and so on the super segmental stuff as well.
Of every possible split, we need to evaluate the goodness of that split.
So we do need to have a measure off the best split on DH.
There's some ways to do that.
We can borrow from automatic speech recognition.
What would ideally like to do is to temporarily split the data and actually trained the model and see how good it wass so.
But some note we're considering splitting.
We train a model, so have model a here.
Then we'll partition the data.
According to some question, we'll have to split on the data associated with one group of models will go this way and the rest of day to go that way.
So two small sets of data down at these leaves all trains from all models down here it level B.
We'll see if the models that level B or better than the models at Level A buy some amount on which have a question which have a split improves the model the best will choose.
And so we might look at the likelihood on the training data or maybe on some held out data.
I was actually completely impractical because training the model is expensive and we have to do it for many, many candidates splits, so we never actually do that for riel.
We just make an approximation, and it turns out there's a very good approximation to the increasing likelihood from A to B on.
We use that instead on that could be computed without retraining any models.
We just need to store some statistics about the alignment between the models and the data.
Now we're not going to get into any more detail on that.
That's standard speech recognition technology.
We'll conclude by stating that we can train this tree.
The trees describes something which is clustering models together, but it's the regression tree.
It's the same thing.
The Tree of Courses then trained on the training data.
And so it's built, knowing only about models that had at least one training example in the database.
And those models are the ones that have got clustered together down here.
So each of these brought, if you like one training token to the party and they put them together.
So they had three, and we trained the model on all three training tokens, got a better train model, and then the models are tied together or sharing their parameters.
But we can easily create models for things that we never saw in the training data, because all we need to know about a model to create it is its name.
In other words, it's linguistic features, its context in its name.
So we just followed the tree down.
When we get parameters for any model, we like the tree works for all models, those that have training data and those that don't.
So it now looks like we have a full set of models, every possible sound in every possible context.
We never bother actually expanding it out and saving that because it would be huge.
Well, just look at the regression tree every time we need to create the parameters from model that we want.
Let's tie that back to this rather horrible looking thing, which is just the name of a model.
Let's decode it for a moment.
This is, ah, model for the sound of the in this left fanatic context on this right frenetic context on the super segmental context, whatever that means.
Position in phrase syllables.
What have you like on we can create this model now by simply going to the tree.
So the street says, Have you got a vowel to the right? Yes.
You have go this way.
It says, Have you got the names of the left? No, we haven't.
We go this way.
Hey, says, have you got the sound to the right on DH? No, we don't.
So we go this way on DH.
Here's the parameters for this model, in fact, the promises for one state of this model because we'd also clear the state index that concludes thiss more practical context, dependent, modelling way of seeing the problem.
Ah, nde.
You could get a lot more experience with that.
If you wished by building a simple speech recognition system, they used context dependent models.
Now you probably wouldn't use models like this.
You might restrict yourself to this amount of context, and that will be called Try Phone on that some time ago was the state of the art in automatic speech recognition on DH.
We would use trees to cluster these models, effectively doing regression just like we're doing here.