After establishing the key concepts and motivating this way of doing speech synthesis, we cover the Hidden Markov Model approach.

Start Videos Class Readings Quiz Finish

Module status: not ready

This module is an introduction to the key concept of statistical parametric synthesis: viewing it a sequence-to-sequence regression problem.

It introduces the first solution to that problem: HMM + regression tree.

Download the slides for these module 7 videos

Total video to watch in this module: 57 minutes

In addition to the core material, there are two bonus videos on hybrid speech synthesis. This uses a model (such as an HMM or DNN) to predict acoustic features for use in an ASF target cost function of a unit selection system. This technique was the state-of-the-art for a considerable time, and can still be heard in some products.

Hybrid speech synthesis is not an examinable topic for 2022-23, but you are still encouraged to understand this technique because it provides another interesting way to think about the interaction between database and waveform generation method.

Download the slides for the bonus videos on hybrid synthesis

Additional video to watch on hybrid synthesis: 29 minutes

Text-to-Speech as a regression problem

By describing text-to-speech as a sequence-to-sequence regression problem, we gain insight into the different ways we might tackle it.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
the goal of this module is to introduce you to statistical Parametric speech synthesis.
We'll start with what, until fairly recently, was the standard way of doing that, which is to use hidden Markov models.
But in fact, we realised that it's just a regression tree on the hidden Markov model is not doing so much just the model of the sequence.
But before we get there, we'll set up the main concept off framing speech synthesis as a sequence to sequence regression tusk from a sequence of linguistic specifications to a sequence off either speech parameters or directly to away form.
So you already need to know about unit selection.
Speak synthesis at this point, and specifically, you need to know how this independent feature formulation target cost uses the linguistic specification by clearing each feature individually in the i f F target cost function there, queried as either a match or a mismatch between a target specifications on the candidate.
In other words, linguistic features of treaters, discreet things we're going to keep doing that.
We're actually going to reduce them down to binary features that either true or false, you also know something about the joint cost whose job is to ensure some continuity between the selected candidates so that they can coordinate.
In the case of unit selection on DH in the previous module, we talked about modelling of speech signals.
We generalise the source philtre model to think just about spectral enveloping excitation, and we prepared the speech features ready for statistical modelling.
For example, we decided we might represent the spectral envelope as the cap from So let's see where we are in the bigger picture of things in its election is all about choosing candidates.
Let's just forget about the acoustic space formulation way of doing that.
For now.
We'll come back to it much later.
Let's just consider this independent feature formulation, which is based only on the linguistic specification, because that's going to be the most similar thing to statistical parametric synthesis, where the input is that linguistic specifications.
We mentioned a few different ways that you might represent the speech parameters for modelling, but the key points are to be able to separate excitation, a spectral envelope on DH tohave, a representation from which you can reconstruct away form.
So we're not going to think for now about directly generating way forms from models were always going to generate speech parameters, at least in this module.
So we need some statistical model, then to predict their speech parameters.
That's the output on the input is the linguistic specification, which we're going to represent as a whole bunch of binary features.
And that's just a regression task.
We've got inputs and outputs.
We've got predictors, onda predict e.
So what we're talking about now is this statistical parametric synthesis method, based on some statistical model on the model, is implementing a regression function on DH.
In abstract terms, the input is linguistic specification not necessary represented in this nice, structured way, but somehow flattened.
Now the output could be the speech way form.
But in fact, most methods until very recently did not directly generate a speech way form, but instead generated Parametric form of that, which is why we had the previous model on representing speech ready for modelling.
Let's go straight in then to thinking about text to speech as a sequence to sequence regression tusk.
So it's a little bit abstract.
We're just setting up the problem, thinking about inputs and outputs.
When we've done that, we can have a first attempt at doing this regression, and that's going to be with a fairly conventional method, which in the literature is called Hidden Markov model speech synthesis.
But it will be really a lot better to call it regression tree speech synthesis, and I will see.
The reason for that is when we get to it, it's a regression tusk Onda Regression task needs an input.
So what is that? What are the input features? Well, they're just the linguistic features.
We've already said that this structure representation, which might be something that's stored inside software such as festival whilst linguistically meaningful, is not so easy to work with in machine learning in unit selection.
In the independent feature formulation, we essentially queried individual features and we looked at match or mismatch between them.
We're going to take that one step further.
We're now going to flatten the specifications onto a fanatic level.
So what does that mean in practise? Consider this one sound here.
The Schwab, in the word that we're going to have a model of that on the model, is going to have a name which captures the linguistic context in which that sound is a caring, so left fanatic context, right, fanatic context.
And then whatever other context we think is important here.
I've put some personally features my notation for this frenetic part.
This is a Quinn phone plus and minus two phones is in HD k style on.
The reason for that is that the standard toolkit for doing hmm based speech synthesis is an extension of H.
D.
K.
I will see in the little bit that will need to represent this other part in a machine friendly formas.
Well, I've left it human friendly for now.
So how do we get this and make it look like a feature vector ready for the machine learning task of regression? We need to do something to vector rise this representation on.
We're going to actually be quite extreme about that input feature.
Vector.
It's essentially going to be a vector of binary features on each Bino feature is just going to query one possible property off this linguistic specifications.
This vector he's going to be quite large, much larger than drawn on.
This picture is going to have hundreds of elements on each element captures something about the linguistic specification.
Let's do an example.
Let's imagine that this element of the feature vector here captures a feature is the current phone aime ce soir, and in this case it is.
And so that value would be one on DH.
These other features might be capturing is the current phoney some other value, and it's not so There were lots and lots of zeroes around it.
What we got there is some encoding of the current phone aim, and it's called one Hot, also called one of an or one of K, and we can see that it's very sparse.
There's a one capturing the identity of the current phone on the lots of zeros, saying that it's not any of the other phones.
Well, then, use other parts of this feature vector to capture context features in the same one hot coating on the remainder of it to capture all of these other things again reduced to one hot codings.
So this big binary vector, then, is mostly zeros with a few sprinkled values of one, and those ones are capturing the particular specifications here.
As it stands, we're going to have one feature vector for each phone in the sequence till have a feature vector for this and for this and for each of them.
So I have a sequence off each directors one per phone in the sentence that we trying to say.
So the frame rate, if you like the clock speed is the phone.
That's the input features dealt with.
What about the output features? Well, we already know the answer that from the last module or we have to do is stack them up into a vector for convenience.
So quite a lot of that Victor's going to be taken up representing the spectral envelope.
Many tens of coefficients, perhaps capital coefficients.
We'll need to take one element in which to put F zero on DH will have some representation off the a period a city, perhaps as those banned a period C parameters the ratio between periodic and non periodic energy in each frequency band of the spectrum.
Now, at this point, I've just naively stacked them all together in one big factor off numbers that captures the entire acoustic specification of one frame off speech.
So the frame rate here with clock speed, as I called it before, is going to be the acoustic frame perhaps 200 frames per second.
We've reduced the problem then to the following.
We have some sequence of input vectors, each one of those say this one here is going to be for a phone in our input sequence.
Maybe that's the Schwab.
And so we've got some clock running here, so we have time.
But the years of time here, a linguistic units here is the phone.
We've got some output sequence.
It also has a time access, but in different units.
The units are now acoustic Frame's fixed clock speed, perhaps five milliseconds.
So 200 per second on the problem now is to map one Sequels to the other.
I hope it's immediately obvious that there's too difficult parts to that.
Mapping one is simply that a linguistic specifications.
Mapping, too.
An acoustic frame is a hard regression problem.
We've got to predict the acoustics off a little bit of speech based on the linguistic specifications.
That's hard.
And the second hard part of the problem is that they're different clock speeds.
We have to make some relationship between the two.
We have to have a linguistic clock ticking along like this marked the line that up with our acoustic clock ticking along like this.
So that's going to involve somehow model in duration.
So the duration is going to tell us how long each linguistic unit each phone lasts.
In other words, how many acoustic frames to generate for it before moving on to the next linguistic unit on generating its acoustic specification and so on, all the way to the end? So that sets up the problem.
The sequence to sequence mapping and it's got at least two hard parts.
The regression part, which is making the prediction of the acoustic properties on DH.
The sequence part.
The two clock rates being in quite different domains.
Once in linguistic domain, one's in the time domain.
We're going to need to find some machinery to do those two parts of the problem for us on our first solution is going to use two different models.
We're going to find a model that's good for sequences of things.
We're going to find a model that's good for regression, and they're two different things.
So we're going to use actually two different models and combine them

HMM speech synthesis, viewed as regression

Continuing our view of the task as one of regression, we see how that can be solved by combining HMMs (for sequence modelling) with regression trees (to provide the parameters of the HMMs).

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
we've got then a description ofthe Texas speech as a sequence to sequence regression task that was a rather abstract description, deliberately because there are many possible solutions to this problem.
What we going to do now, then, is look at the first solution on this solution is the one that's been dominant in statistical Parametric speech synthesis until quite recently on.
That's the so called Hidden Markov model approach, or hmm, based speech synthesis, although that's the standard name, So we're going to use it.
There's a little bit misleading because the work isn't really being done by hidden Markov model.
We'll see in a moment.
It's mostly going to be done by a regression tree in order to understand why people call it hidden Markov model based speech synthesis.
But why we here are going to prefer to call it Regression Tree plus hidden Markov model speech synthesis.
I'm going to give you two complimentary explanations.
These two explanations will be completely consistent with each other.
They're both true, and they're both perfectly good ways to understand it, So if one works better for you than the other, go with that one.
But it's important that you see two points of view two different ways of thinking about the problem.
What I'm going to do is I'm going to defer the conventional explanation off context dependent hidden Markov models until second.
I'm going to first give the slightly more abstract but more general way of explaining this in terms of regression.
When those two explanations are completed, we'll need to mop up a few details on DH.
The practical implementation of context dependent models will immediately ask.
How do we do duration modelling? So we just touch on that very briefly on.
Once we've got our model, we'll need to generate speak from it, and I'm going to deal with that extremely briefly.
And so this is another good point, which to remind you that these videos really are the bare bones.
They're just the skeleton that you need to start understanding, and you need to flush that out with readings.
That will be particularly true for generation.
I am not going to go into great detail about the generation algorithm on DH.
One reason for doing that is that this generation algorithm, as it stands, is somewhat specific to the hidden Markov model approach to speaks emphasis which is rapidly being superseded by your network approaches, so I don't want to dwell on it too long.
We need to understand what the problem is that there is a solution, but we're not going to look in detail about the solution.
Just describe what it does is really fairly simple anyway.
So what are these two complimentary explanations off hidden Markov model based speech synthesis? The first is to stay with our abstract idea ofthe regression, a sequence to sequence regression problem.
And so, in the coming slides, I'm going to tag that for you in this colour ongoing right regression on those slides just so you can see that that's that part of the explanation.
The other complementary view is to think very practically about how you actually build a system to do this.
How would you build a regression tree, plus, hmm system to do Sequels to sequence modelling on DH? That view actually starts from hidden Markov models, it says.
We'd like a hidden Markov model of every linguistic unit type.
Let's say the phone in every possible linguistic context.
Let's say Quinn, phone plus Prasad in context.
And then we immediate realise we're in trouble because that's a very large set of models on for almost all of them.
There's no training data, even in a big data set on, we have to fix that problem of not being able to train many of our models because they're unseen in training on DH.
That's by sharing parameters amongst models on.
What we're going to see is that this regression on the sharing are the same thing.
We'll start then with the regression view we have.
As I said, two tasks to accomplish.
We have to do the sequencing problem.
We have to take a little walk through the sequence off phones.
Each one will have its context attached to it in this sort of flattened structure, so it will look more like this on DH.
That walk is at a linguistic timescale.
Phone to phone to phone for each of those phones we have toe.
If you like, expand it out into a duration of physical duration.
Each phone will have a different duration to the others who need some sort of model of duration to expand that out on for each phone.
Then to generate a sequence of speech parameters to describe the sound of that phone.
In that context, we need a sequencing component to our solution.
But sequencing isn't too hard just counting from left to right on deciding how long to spend in each phone.
Perhaps more difficult is the problem then off when we know which phone you're in and how far through it we are, we need to predict the speech parameters.
What's the sound? And that's the second part of the problem.
Throw out this module.
We're not going to go all the way to speech way form.
We'll just do that at them with a vocoder.
We're going to predict a sequence off speech parameters, and they're putting these vectors, and we've seen what those like those are the output feature vectors.
So we need to choose a model of sequences.
I don't need to choose a model for aggression on.
Let's just choose the models that we already know about for sequencing.
The most obvious choice is the hidden Markov model.
Why? Well, it's the simplest model that we know off that can generate sequences that's choose that on for regression, where there are lots and lots of different ways of doing regression.
But we can certainly say this is a hard regression problem because the mapping from this linguistic specification to speech parameters say the spectrum is for sure.
Nonlinear might even be non continuous.
So it is a really tough regression problem to a better pick.
A really general purpose regression model, something that can learn arbitrary functions, arbitrary mapping tze from inputs to outputs from predictors to predict e.
On.
The one model that we're pretty comfortable with because we've used it many times is a regression tree.
So let's pick that with those chosen models.
They might not be the best, but we know about them and we know how to use them.
And that's just a CZ important as them being good models in formation at the end.
Why, that is, it's because we know how to, for example, train them from data.
Here's a hidden Markov model hit a Markov.
Models are generative models, although their main use in our field is for automatic speech recognition, which is a classification problem.
We can create classifies by having multiple generative models and having them compete to generate the observed data on whichever can compete it with the highest probability we assign that class to the data here.
We're just going to generate from them.
So here's a hidden Markov model on dis generative so it can generate a sequence off observations.
These are the speak parameters, the vocoder parameters.
Now my abstract picture of a hidden Markov model of drawn little gal scenes in the States.
Of course, the dimensionality of that galaxy in is going to be the same as the dimensionality off the thing we're trying to generate.
So they're multi variant Gossens.
I've just drawn single ones here to tell you that their garrisons in those states.
So how much work is this hidden Markov model doing? Well? Not a lot, really.
It's saying, First, do this and you can do it for a while.
You choose and then do this and so on.
So it's really just saying that things happen in this particular order that's appropriate for speech, because speech sounds happen in a particular order, whether that's within a phone or within the sentence, we trying to say we don't want to have things reordered.
The model of John here is Hidden Markov model.
It has these self transitions, and that's the model of duration.
At the moment, we'll revisit that a bit later because that's a really rubbish model of duration.
Actually, it's not going to be good enough for synthesis, but it's good enough to start our understanding, so we'll leave it as it is for now.
That's a generative model, and it's a probabilistic generative model, so we could take a random walk through the model.
We could toss biassed planes to choose whether to take self transitions or to go to the next state on DH from each galaxy in, we could somehow generate on observation factor at each time stamp those we could take what would amount to a random walk through this model controlled by the models parameters on generate somehow put.
We'll come back at the end, toothy little bit more about how we do that generation.
But before we can get to generating for model, of course, we'll have to training on some data.
Now, in this very naive and simplified picture, I've drawn a Gaussian in each state, and this is a model say off a particular sound.
In particular, linguistic context was a model of a very specific sound in a very specific context, on DH.
It has five emitting states that's normal in speech synthesis to get a bit more temporal resolution than we would need in speech recognition.
So I have many, many, many models, and each of the models has five states, and I've said each state in East have its multi, very Gaussian so it can do this generation thing.
Just do the multiplication in your head.
How many phones are there? How many Quinn phone context can they be in on each for each of those? How many Prasad it context come to be in and then multiply by five and you'll see that's a very, very large number of gas.
Ian's so large, there's no chance we could ever train them all on any finite data set in this naive set up.
So we need to provide the parameters of the model in some other way.
The models can't have their own parameters.
They're going to have to be given parameters.
So what's the model doing? What is this generation? Well, that's the regression step.
It says you're in the second emitting state of a five state model for a particular speech sound on, given that in other words that is your sort of predictors.
Please predict or regress to an acoustic feature vector.
Another was the promise of the galaxy in our the product off the regression part off the problem.
If you remember this, which I hope you do, this is a classification and regression tree, remember, is really a classification or regression tree because it's operating in one mode or the other at any one time.
We spent a lot more time talking about classifications on regression, but the ideas are exactly the same.
We're going to use this machinery to provide the parameters.
So I had a Markov model because we know lots of things.
We know the phone on its context.
In other words, we know the name of the model.
We can ask questions about the name of the model.
Yes, no questions on given sequence of questions and the answers.
We can descend this tree on arrival, the leaf in which we find the value ofthe the parameters of that state, and there was the mean and variants of the Gaussian.
What it amounts to, then is eating the hidden Markov model simply as a model of a sequence it says do things in this order and spend about this long doing each off them Onda Regression Tree to provide the parameters of the states.
You know, there was the means of various the Gas Ian's, which is the regression onto acoustic properties onto speech parameters.
But spell it out in a little more detail because that's kind of complex and potentially confusing idea.
Here's a linguistic specifications, Andi.
For every sound in every context, we have a hidden Markov model.
Let's stick with the Schwab in the word the in this vanity context on this prasad in context.
In our huge set of hidden Markov models, we have a model especially for that sound in that context, and there it is.
It's got five emitting states.
But for all the reasons we just explained, that model does not own its own parameters.
Perhaps we never saw this sound in this context in the training data, so we couldn't train a model just for that sound.
In that context, this model's not simply not trainable from the training data, we're going to provide it with its parameters by regression.
In other words, there will be some regression tree.
I'll have a root.
It will be binary brandishing and that each node will ask questions on the leaves will arrive a promise of the model.
Let's imagine that the first question in the tree is Is the centre phone a vowel? Yes, it is.
Maybe this is the s brunch.
And then we might ask another question here, as is the phone to the right.
The stop.
Yes, it is on DH.
Maybe there's a very small tree.
We've got a leaf that the leaf we've got, I mean the variance of Gaussian.
And that's the prediction off the parameters off one state of this model.
So in to go into the state now the tree is going to be a lot bigger than that.
It's going to have to ask a lot more than just the centre phone on the right phone.
On the very least, it would have to also ask about state position.
But that's fine.
We know that that you could be just another predictor is the state position 1234 or five and it should probably ask about various other features.
Now that tree is going to be learned from data in the usual way.
We're not going to draw that by hand.

HMM speech synthesis, described as context-dependent modelling

This is the conventional way to describe HMM-based speech synthesis. It corresponds to the way a system would be built in practice, and is similar to HMM-based automatic speech recognition using triphone models.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
So we've described synthesis a regression tusk with a hidden Markov model to do the easy kind of counting part.
You do the beginning, then the middle and the end of each phone, say, on the regression tree, to do the rather harder problem off predicting speech parameters.
Given where you are in each phone, how far through the model what the phone is, what its context is.
That was the name of the model that still remains a little bit abstract and in particular it's not completely clear how to get the regression tree.
How would we build such a thing? And so I think a good way to understand how that happens is to actually take a rather more conventional on DH practically oriented view of the whole problem on.
Just think about how we make these context dependent models, how we do parameter sharing something called tying and then to realise that that tying, which is gonna be done with the tree is the regression tree.
And then we'll come full circle and we'll see that our rather abstract synthesis as a regression tusk explanation makes a lot of sense.
After we then see this practical way of deploying that to build a real system.
So we done with the regression explanation.
We're now going to take the context dependent modelling explanation.
And I've tagged the slides with that through here so you can see what's going on might be a good idea.
First, they would have just a little recap about these linguistic features and what's going on there.
We've got this structured thing.
It's the utterance data structure in festival Andi.
We're basically flattening that.
Another was were touching everything that we need to hold on to everything that we need to know about this to the phone.
So we write the name of the phone.
So it's phoney name, left and right context fanatically and its surrounding context.
Prasad Ecklie.
That might be things like syllable stress.
You might be really basic things like its position.
Is it at the beginning of a phrase at the end of a phrase.
So this is a recap of how we did that in unit selection, but it's basically the same thing.
We have this unit here, this abstract target unit, and it's occurring in some context, not unit in context, then has all the context written onto the unit itself, says there locally.
We don't know if you have to look around it.
It's written onto the name of the unit itself.
That's what I mean when I'm saying flattening So the sequence off units in this case can be re written as the secrets of context dependent units.
So they're going to get rather complicated looking names because the name is going to capture not just what they are, but what context they're occurring in.
Let's see an example of that for your favourite sentence.
So here's the first sentence of the Arctic corpus author off the Danger Trails Phillips deals, et cetera.
It's just the beginning of that carries on on DH.
Each phone is written here.
So again we're using H.
D.
K notation.
So the centre phone occurs between the miners and plus signs.
So it's this or ah, off the et cetera when we've constructed now in machine friendly form.
So this isn't very human friendly in machine friendly form the name off the hidden Markov model that we're going to use to generate that sound.
That phone in that context, for example, when we come to generate the Schwab in there.
We're going to need a model called this.
So no big set of models.
We better make sure that we have a model that's called that all this horrible stuff here is capturing position and super segmental stuff Just encoded in some way on this is the Quinn phone.
Of course, what we're doing here actually is taking a very much an automatic speech recognition view, and that is that we'll write down context.
Bennett models will realise that it's impossible to actually train them on any real data set, and then we'll find a clever solution to that.
So let's progress through now.
This complimentary explanation the one that's much closer to practise to how you would actually build a real system like this.
And that's to describe everything as a context dependent, modelling problem.
You're doing the readings.
You'll see that Taylor calls this context sensitive modelling, But that's the same thing to any reasonable size data set.
We cannot possibly have every would've called unit type here, so every phone in context, it's not possible to have atleast one of each in any big data set.
Of course, we need a lot more than one to train a good model, we might want 10 or 100.
And the reason for that is that context has just been defined so richly.
Quinn phone Plus, all this other stuff actually spans the whole sentence because we might have position in sentences, a feature.
And that means that almost every single token occurring in the training data set is the only one of its type.
It's unique.
There might be a few that have more than one token, but not many.
Most of them occur once that do occur on the vast majority that are possible.
Don't care at all because the number of types is huge, so created a problem for ourselves.
We would like to have context dependent modelling because context affects the sound that's very important for synthesis context.
Independent modelling would sound terrible for speech synthesis, but those context dependent models are so great in their number of types that we basically don't see hardly any off him.
Just see a small subset off them.
In our training data was a pair of highly related problems.
We need yourself.
We have some things that occur once we could train a bad model for a poorly trained model on.
There are lots and lots of things that don't care a tall when we want to model for us.
Well on the solution to both those actually the same thing.
And that is to share parameters amongst groups of models.
So we're going to find models that we think basically are off the same sounding speech Sounds in context.
We're going to say they're close enough.
We got she's used the same model.
Let's just restate that in a slightly more fine grained way instead of just primary share amongst groups of models.
Similar, we might do it amongst groups of states, so we might not tie one whole model to another whole model.
We'll just do it state by state.
It achieves the same thing, just with better granularity.
So a little bit of terminology flying around there.
Let's just clarify what we're talking about here.
A model is a hidden Markov model.
Let's say it's got five emitting states and it's off a particular phone in a particular context, and so a model is made up of states and in the States are some parameters a multi, very Gaussian ready to emit observations and do an actual speech generation when we say parameter sharing amongst groups of similar models who could equally well say sharing amongst groups of similar states on models are nothing more than their parameters.
Hidden Markov model has the promises in the states and the parameters on the transitions, and that's it.
So we talk about parameters models pretty much the same thing that we're talking about.
The core problem, then, is how do you train a model for a type for a phone in context that you have too few examples off and too few includes none.
Well, let's forget the things that have no examples.
That seems almost impossible at the moment.
Let's just think about the things that have a single example a single token and occurrence in training set.
We could train a model on that.
They just will be very poorly estimated on so it wouldn't wear very well if it was a speech recognition or for speech synthesis.
We need more data, and so we could find a group of models, each of which brings one training token to the table on.
They share their training data.
If they share their training data, they'll end up with the same parameters.
And so that's the same as saying that they're going to share parameters so we could pull training data because groups off models I've said types here that'll increase amount of data, you'll get much better estimate of model parameters, and you'll end up with whole groups of models.
All actually have the same underlying parameters.
That's why we say they're sharing their parameters or that those models or their states are tied.
How to decide which groups of models should be doing the sharing of data and then the consequence of which is sharing or tying of parameters? Well, let's use some common sense.
Think about some phonetics, so the key insight is going to be.
The key assumption we're making throughout this is that there are many different context that exit rather similar effects on the current phone.
And so you don't really need a model for every different context.
You can use the same model for a whole bunch of contexts because they're so close.
It's good enough on that lets you share data, so we'll group context not by rule or anything like that, but according to the effect they have on the current sound.
So imagine some sound occurring in some context.
We got this are in Pat.
But that's imagine that this sound on this sound are so similar that we could use the same models generated neither case.
In other words, it doesn't matter where the left context is per or Burr used the same model now, in this case, a good way of detecting that put on.
But the same is that they have the same place in articulation, representing the context not simply Justus phones but as their genetic features.
Sounds like a smart thing to do, and that's quite common.
So will represent phones, not as one hotting codings out of the phone set, but lots of one hotting codings out ofthe place manner voicing whatever the Phanatic features you would like to use now, it could actually try and sit down to write rules about that, to express our knowledge.
That example, on the previous slide that would probably work.
We could say anything that's got a thing with the same place that the left could share some parameters that's going to run into trouble for lots of reasons.
First of all, it's gonna be really hard on DH.
Second of all, machine learning is always going to be better.
We're going to learn these sharing Tze from data but the model will end up with will look very much like the one we might have written by hand.
This would be a whole bunch of basically if then else rules on DH, we're going to do it in a way that actually pays attention to the data that we've got at hand.
So for any particular database, we're going to find groupings that actually appear to make a difference to the acoustics to the sound.
And if he doesn't make a difference, we'll share the parameters on the other good reason for living from data is that we can scale the complexity of thes rules of granularity, of the sharing according to how much data we have.
And I think it's reasonable to just state that the more data we have, the more models we could afford to train.
In other words, if models of sharing parameters they need to share less and less as we have more and more data and in the limit would have a separate model for every single sound in every single context.
We'll never get there because that would need almost infinite data.
So if any real amount of data, there's a certain amount of sharing of parameters that's optimal on weaken scale that according to how much data we have, that's a very good property of machine learning.
In other words, if you have more data, you could have a model with more parameters, and that's normally what you'd want to do in machine learning.
But they're fanatic.
Knowledge that we've expressed in these rules is perfectly reasonable.
It's sound.
It's just hard to get it manually down into some rules.
We want to get that out from data.
We do want to base this on some fanatic principles on the way that we do.
That is with a decision tree.
The decision Tree combines knowledge and learning from data in a very nice way.
The knowledge is in feature engineering on DH in writing down the questions about those features.
So what do we query? For example, we might query frenetic features like neutrality, or we might query identities of phoney names themselves.
That's where our knowledge has gone in on the data, tells us how to arrange these questions about features in a tree to get the best possible performance.
This is just a regression tree.
It's the regression treat we're talking about rather abstract Lee In the previous section.
The predictors are the linguistic features they might have been explicitly stacked into a vector and turned into binary or not.
They might be just implicit in the names of models amounts of the same thing.
So these questions about predictors these the answers to those questions on DH.
If we answer the questions in the order, the tree specifies, we end up is a leaf.
This is a leaf node, and then the leaf node is thie output of the regression in the Lewis and continuous parameters.
Now this picture is relating to automatic speech recognition where the terminology is that these states have ended up down at the same leaf and they get tied together tying means they pull their training data and train a single set of parameters from that, which means they'll have the same value of the parameter.
But if we think about that as regression, it's the values of those parameters.
That's the output of this model This is a regression tree where the predictors are effectively taken from the name of the model that we're trying to give parameters to.
On the predict E is a set of parameters itself the mean in various of the galaxy.
Um, so we know quite a lot about CART already, and we're not going to go back again through the training algorithm.
We do, though, need to just think about one thing, and that's when we're building the tree.
But each split.
We considering many alternative questions.
We've got a big, long list of questions is that a vowel to the right is a nasal to the right is a stop to the right is a B to the right is a Peter the right, and so on the super segmental stuff as well.
Of every possible split, we need to evaluate the goodness of that split.
So we do need to have a measure off the best split on DH.
There's some ways to do that.
We can borrow from automatic speech recognition.
What would ideally like to do is to temporarily split the data and actually trained the model and see how good it wass so.
But some note we're considering splitting.
We train a model, so have model a here.
Then we'll partition the data.
According to some question, we'll have to split on the data associated with one group of models will go this way and the rest of day to go that way.
So two small sets of data down at these leaves all trains from all models down here it level B.
We'll see if the models that level B or better than the models at Level A buy some amount on which have a question which have a split improves the model the best will choose.
And so we might look at the likelihood on the training data or maybe on some held out data.
I was actually completely impractical because training the model is expensive and we have to do it for many, many candidates splits, so we never actually do that for riel.
We just make an approximation, and it turns out there's a very good approximation to the increasing likelihood from A to B on.
We use that instead on that could be computed without retraining any models.
We just need to store some statistics about the alignment between the models and the data.
Now we're not going to get into any more detail on that.
That's standard speech recognition technology.
We'll conclude by stating that we can train this tree.
The trees describes something which is clustering models together, but it's the regression tree.
It's the same thing.
The Tree of Courses then trained on the training data.
And so it's built, knowing only about models that had at least one training example in the database.
And those models are the ones that have got clustered together down here.
So each of these brought, if you like one training token to the party and they put them together.
So they had three, and we trained the model on all three training tokens, got a better train model, and then the models are tied together or sharing their parameters.
But we can easily create models for things that we never saw in the training data, because all we need to know about a model to create it is its name.
In other words, it's linguistic features, its context in its name.
So we just followed the tree down.
When we get parameters for any model, we like the tree works for all models, those that have training data and those that don't.
So it now looks like we have a full set of models, every possible sound in every possible context.
We never bother actually expanding it out and saving that because it would be huge.
Well, just look at the regression tree every time we need to create the parameters from model that we want.
Let's tie that back to this rather horrible looking thing, which is just the name of a model.
Let's decode it for a moment.
This is, ah, model for the sound of the in this left fanatic context on this right frenetic context on the super segmental context, whatever that means.
Position in phrase syllables.
What have you like on we can create this model now by simply going to the tree.
So the street says, Have you got a vowel to the right? Yes.
You have go this way.
It says, Have you got the names of the left? No, we haven't.
We go this way.
Hey, says, have you got the sound to the right on DH? No, we don't.
So we go this way on DH.
Here's the parameters for this model, in fact, the promises for one state of this model because we'd also clear the state index that concludes thiss more practical context, dependent, modelling way of seeing the problem.
Ah, nde.
You could get a lot more experience with that.
If you wished by building a simple speech recognition system, they used context dependent models.
Now you probably wouldn't use models like this.
You might restrict yourself to this amount of context, and that will be called Try Phone on that some time ago was the state of the art in automatic speech recognition on DH.
We would use trees to cluster these models, effectively doing regression just like we're doing here.

Wrap-up

Brief coverage of the end-to-end process of building and using an HMM-based speech synthesiser, including some mention of speech parameter generation and duration modelling.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
Let's summarise this whole process.
This practical process off hidden Markov model with regression trees based speech synthesis.
We start the text to speech pipeline with linguistic processing that just turns text into linguistic features in the usual way.
So you got the same from tenders you always have.
We then flattened the structure so we don't have trees and all of that exciting linguistic stuff.
We just attach context to the phone in level.
So we flatten the linguistic structures known for this flattened structure.
We would like to create one context, depend hmm, for every unique combination of linguistic features.
But we realised, Actually, that's impossible.
So we need to have a solution to that problem.
The solution happens.
Well, we're training the H M M's.
Of course, we need label speech data.
This is just under supervised learning on all of those models that don't have any training data or don't have enough training data where this single elegant solution and we could describe that as if you like re parameter rising the models using a regression tree.
In other words, the models don't have as many free parameters it first looks.
In other words, if we took the total number of phones in context times five states, and not this very large number.
That's not how many promises model housing as far far fewer than that because it's being re parameter rised by the regression tree.
In other words, we keep any aboutthe same leaf the tree over and over again.
For many different states of different models, they are sharing their parameters to synthesise from the hidden Markov models.
We take the front end that gives us this flat context dependent phone sequence.
Each of those is just the name of a model.
We pull that model out of our big bag of models, the models of sort of empty.
The states don't have parameters.
So we go to the regression tree and we query it, using the name of this model to fill in the parameters of the model.
Well, then, just concatenation of sequence of context dependent models for the centres were trying to say.
We somehow generate speech parameters from that.
We don't know how to do that.
Yet you're about to say something in a moment on.
Then we'll use a vocoder to convert the speech promise to away form and we already know how to do that.
As I warned you, near the start of this module, we're not going to go into great technical detail about the generation algorithms from Regression Tree plus hmm, Synthesisers.
There's quite a lot of literature about it, and it looks like it's sort of prominent and difficult problem, however, because hmm, synthesis is probably going to be completely superseded by in Your Net synthesis.
We're not needing deep understanding off this generation algorithm.
It's really just smoothing.
It's just fancy looking, smoothing.
So what we'll do then we'll state how we would generate from Hidden Markov.
Model will spot a problem with that, and I'll give you just graphically the solution to that problem.
Well, since hidden Markov models are generative models, generating from them should be straightforward on DH.
Yes, of course it is.
We could just run it as a generative model.
For example, we can take a random walk through it.
Let's just assume that it's a hidden Markov model with transitions between states and self transitions that give us a model of duration of staying in that state.
Once we're in a state, we need to generate some output we need to emit on observation.
Well, we should follow the most basic principle there is in machine learning, and that's the maximum likelihood principle and just do the most likely thing.
I mean, what else would you do? You wouldn't do the least likely think so.
We generate the most likely output when the most likely value that you can generate from a Gaussian distribution is, of course, the mean.
Because that's where the probability density is the highest.
So we'll just generate the state.
Mean, if we stay in one hidden Markov model state for more than one frame, we'll just turn it the exact same value each frame because it's mark off because we don't remember what we did the last frame on because we're following the maximum.
Likely the principal would do the same thing every time.
So for as long as we stay in the one state will generate constant output, that doesn't sound quite right, but that's what we're doing so far in this is a mark of model.
This self transition is the model of duration.
For example, we might write some number on it, and that's a probability of staying in the state will get some rather crude exponentially decaying duration distribution from that that's not good enough for synthesis is good enough for recognition.
But in generation we need to generate a plausible duration, and this model actually gives the highest probability to the shortest possible duration of one frame.
That's not good.
So we actually need a proper duration model.
So I hit the mark off model.
I looked like this, and it doesn't have this self transition.
It has some other model of duration that says how long we should stay in this state for another was how many frames of should emit from the state.
That's an explicit duration model, not just this rather crude self transition.
So the model is actually technically no longer hidden Markov model.
It's Markoff from state to state.
But within a state, there's an explicit duration model, so you'll see in the literature that these air called hidden semi Markov models Seminars in the half.
Markoff, where is that model of duration coming from? Well, we'll use some external model.
Basically, what do we know? What we know the same stuffers for predicting the other speech parameters we know name of the model on which state we're in.
So I have a bunch of predictors.
It is linguistic context.
In other words, name of model on where we are in that model on.
We have a thing want to predict.
Predict is just the duration stated in frames in these acoustic clock units on, we just need to regress from one to the other.
Just another regression problem.
So let's just use another regression tree straightforward.
It's going to be predicting duration at the state level.
States are sub phone units going predicted duration per state on by adding the situations that would would get the duration ofthe the phone itself.
So this description of duration modelling was very brief.
And that's also deliberate because so far we using a regression tree to do that, which is not a great model.
So we can talk a little bit more about duration modelling when we go on to more powerful things than regression trees.
And that's later in the neural net speech synthesis part.
But we'll take the regression tree for duration, as is.
It will be okay, but we can beat it later.
And let's think about this problem off, generating constant output for as long as we stay in an individual state.
Otherwise we could say it's piece wise, Constant store picture of that.
See why that's really bad.
He's a hidden Markov model back to hidden Markov model.
So I've got transitions here.
This hasn't got the explicit duration model, but it doesn't matter.
The principle is exactly the same, and I'm going to generate from it.
And I'm just going to do this for one dimensional feature.
So these really are just uni variant gas ians now.
So I'm going to generate from against time, which is going to be in frames.
I'm going to generate some speech parameter pick any speech barometer you like.
I don't know what your favourite speech parameters.
I will pick up zero.
So this is going to have zero.
Let's say it hurts on.
We go to general zero from this model as we move through it from left to right.
What happens when we do that? Well, we'll just generate a constant value for as long as we stay in each state on the jet trees, just joining those things up just to make that clear how that works.
It's the galaxy in from each state is in the same dimension is the speech parameter.
So that's the Gaussian over F zero on DH.
This was the mean off that parameter, and we just generate the mean We joined those up.
We get this rather chunky looking thing, and I've never seen an F zero control look like that.
If we play out back, that's going to sound pretty robotic.
That can't be right.
What we need is to generate something smooth.
We need to join up those means, but we don't want to be completely constant within each state.
Essentially want a smooth version of that trajectory? Maybe that looks like this now in the Literature for Hidden, semi Markov model based speech synthesis, which is often still written, is Hidden Markov model based speech synthesis.
We will find algorithms to do this in a principled, on correct probabilistic way, but we can really just understand this as some simple smoothing That's good enough for our understanding.
At this point.
The correct algorithm from doing this has a name.
It's called maximum Likelihood parameter generation, and it pays attention not just to the mean but also to the standard deviation of the Galaxy ins on DH really importantly to the slope.
How fast speech parameter can change.
That would be really important f zero, because we can only increase or decrease zero to certain rate their physical constraints on what are muscles Khun do, for example, so am L.
P G pays attention to the statistics not only of the speech parameter will also off its rate of change over time, in other words, off its Delta's on, of course, and we can also put in Delta Deltas.
So it's a smoothing that's essentially learned from data to smooth in a way that's the most natural in the way that the data itself is smooth.
I'm not going to cover.
Not here.
It's available in papers if you wish to understand it.
It's been shown that rather simple ad hoc smoothing actually achieved the same sort of outcome in terms of natural nous as the correct M.
L.
P G.
So it's really okay to think of Emma.
PG as smoothing.
That's the bare bones, then really the bare bones off Hmm Bay Statistical Parametric speech synthesis.
It's just a first attempt, and it's being deliberately kept.
That's a slightly abstract in high level for two reasons.
One is thought, the technical details of better gain from readings.
But the main one is that this method of synthesis is slowly becoming obsolete.
Well, it's not gone yet and is slowly being replaced by your Net based speech synthesis.
Let's just understand what we've achieved so far.
That was our first attempt at statistical Parametric speech synthesis on DH.
We did it by choosing models that we know well.
That seems a little naive that we've decided only to choose some models we've already used in the past.
But it's actually perfectly sensible, and the reason is that we've got good algorithms for those models.
We understand very, very well how to use it.
A Markov models.
This idea of clustering them is well established in automatic speech recognition.
We've re described it ours regression here.
But the ways of doing that are very well understood and very well tested On DH, for example, they're also good software implementations off him, So I got algorithms for building the trees.
We've got really good algorithms for trained.
The hidden Markov models themselves the ban on Welsh algorithm, which still applies when the models being re privatised in this way, we're not going to get into how that works.
But we're just going to state that a model that's parameter rise using a regression tree is no more difficult to train than one that has its own parameters.
Inside, each model state doesn't change.
We've got rather poor models, but really good algorithms.
That's a common situation in machine learning.
However.
Regression trees are too bad, although we can train them and they're really faster run time.
We can inspect them if we really want to.
They're somehow human friendly.
They are the key weakness of this method.
We really must replace them on.
We're going to replace them with something a lot more powerful.
However, there is something really good about this.
Hmm, plus regression tree framework.
And that's the galleons we like guardians.
They're mathematically well behaved and convenient on.
We understand them very well, but even more importantly, they've been used for so long.
An automatic speech recognition that we can borrow some really useful techniques on some key techniques that we might borrow will be something like model adaptation that would allow us to make a speech synthesiser That sounds like a particular target speaker based on a very small amount of their data, and that would be a very powerful thing to do.
Our ability to do that comes directly from the use of gas Ian's and the fact that methods were developed for speech recognition to do speaker adaptation that we can pretty much directly borrow into speech synthesis.
Okay, where next? Well, a better regression model, of course, on the one we're going to choose should be no surprise.
It's also very general purpose.
He can learn arbitrary mapping Tze and it's on your network the inputs and outputs to this network essentially going to be the same as to the regression tree.
So the predict ease and predictor basically the same you know there was.
We're still going to use a vocoder that will limit our quality.
Later.
We'll try and fix this problem by generating way forms by concatenation or maybe even directly generating out off the neural network.
It's worth noting, though, that what we win by using your network over regression tree.
We also lose a little bit because there's no gassings involved anymore.
Think about what those gals Ian's are there in the acoustic space they're in the same space is the speech parameters.
In other words, somewhere in our system there are whole load of gas.
Ian's, who's mean is a value of F zero on.
So if we wanted to change the zero of the model in some simple, systematic way, may we just like to make the whole system speak at a slightly higher pitch.
We know which parameters we can go to and change.
We could just go multiply them all by some constant greater than one, some very simple transform that would change the model and change its output.
That's not going to be so easy in your network because the parameters are inside the model, and it's not obvious which parameters.
Which thing is a very distributed representation? We do lose something.
We'll come back to that much later.
So to wrap up, we've got a method for doing speech synthesis framed as regression.
We've tried a very simple regression model.
Are we going to know you's a better regression model? And so we now need to learn about your networks, and that's coming up in the next module

Bonus material: partial synthesis

The simplest type of hybrid synthesis is essentially unit selection with an ASF target cost function, where the acoustic features are created through "partial synthesis".

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
in this module.
We're going to bring together the two key concepts that we've covered so far.
And those are unit selection, which generates were formed by concatenation of recordings on Statistical Parametric Speech synthesis, which uses a model trained on data.
And we're going to use that statistical model to drive a unit selection system.
So obviously, you'll need a very good understanding of statistical parametric speak synthesis before proceeding.
You could either do that with HIDTA, Markoff models or deep neural networks.
It doesn't matter.
Either of them could be combined with unit selection to make a hybrid system.
You also obviously need to understand how unit selection works on particularly that you could get potentially very good naturalist from such a system.
Now statistical Parametric speech Census might be flexible and robust.
The labelling errors but in systems that use of Oh Koda naturalist is limited by that as well as other things such as the statistical model not being perfect.
Hybrid synthesis has the potential to improve naturalness compared to statistical parametric speech synthesis.
In contrast, unit selection potentially offers excellent naturalists simply because it's playing back recorded way forms.
But if the database has errors of any sort and particularly labelling errors, they will very strongly affect the natural nous.
Another problem with the unit selection system is it's quite hard work to optimise it on new data, even a new speaker of the same language.
We need to perhaps change waits in target cost or join cost.
That's hard work on DH.
We can never be completely sure that we've done the best possible job of that.
So by combining statistical Parametric systems with unit selection, we have a potential taking the best of both worlds and in particular taking this robustness and the fact that we can automatically learn from data and combining it with the naturalness of playing by way forms.
We can get a system which has the naturalness of unit selection but is not as affected by, for example, labelling errors on the corpus.
What perhaps isn't as much worked optimise on a new voice as well as knowing about those two concepts you need to know about the components behind that on.
Do you need to particularly know something about signal processing on what we need to know Here is about how we might parameter isa speech signal on that we might do that in rather different ways.
If we're classifying those speak signals, for example, we're doing automatic speech recognition compared to if we want to regenerate the speech signal from the Parametric form from the speech parameters.
That's called bo coding, as who might use very different parameters in these two cases.
Back when we talked about unit selection, we spent some time thinking about Spar City on.
We considered how that interacts with the type of target cost function we're using, whether it's measuring similarity between candidates and targets in independent feature formulation way.
In other words, based only on linguistic features or in acoustic space.
It is SF style target cost.
And we made the claim that if we could measure similarity well in an acoustic space who might suffer from less Spar City problems than where we measure in linguistic space? If you don't remember why, that is, go back to that module on unit selection and compare again this independent feature formulation on acoustic specs formulation for the target cost function.
When we talked about statistical Parametric speech synthesis and the module on H M.
M's and then the module on deep neural networks.
We tried to have a unified view of all of that on DH.
If we'd like a very short description of what statistical Parametric speech synthesis is, it's a regression problem from a sequence of linguistic features to a sequence off speech parameters.
So it's a sequence two Sequels regression problem.
So we're going to take now the knowledge ofthe signal processing and how we might represent speech with the problems of unit selection of Spar City, combined with a technique for secrets secrets regression, such as a deep neural network on Build What we're going to call, ah, hybrid speech synthesiser hybrid simply because it combines unit selection on a statistical model, a phrase that you will have come across in the readings from Taylor is this idea of partial synthesis.
This idea is going to be important now in the statistical model that we used to drive this unit selection system in the hybrid set up.
We don't need to generate a speech way form from the Parametric representation that will eventually happen through concatenation.
That's why Taylor says, partial synthesis.
We're not going all the way to a speech way form.
We're going to some other representation, which is good enough to then select way forms.
That means we've got choices about what representation we generate.
It does not need to be the same as we would need when driving a vocoder.
For example, our model could just generate, MFC sees on.
We could use those too much against candidates in the database.
Equally, we don't need to generate the high frame rate that we would need.
If we vote coding perhaps 200 frames per second, we might predict the acoustic properties far less often, maybe once per segment or once for each half of a di foehn and use that in the target cost function.
So a lot more flexible in what we generate from our statistical model.
When we doing hybrid synthesis compared to statistical Parametric speech synthesis, that's the idea of partial synthesis.
And keep that in mind throughout that the statistical model may or may not be generating vocoder parameters might be generating something a bit simpler.
Another way to describe Hibri speech synthesis is that it's just statistical parametric speech synthesis with a clever Vukota with Dakota that generates speech in a clever way so we could draw a picture of that.
We have our statistical models.
It might be models off context dependent phones on.
They will generate a Parametric representation of speech speech parameters.
That doesn't have to be vocoder parameters.
It could be anything you want.
But instead of using a vocoder to get from those parameters to away form, we use something else.
We use a database of recorded speech and then concatenation the fragments.
The candidates.
We can view that statistical parametric speech synthesis with a very clever vocoder based on a speech database.
Or we could describe this house fairly traditional unit selection with a target cost function operating an acoustic space So the acoustic space will be the speech parameters, which or whatever you want them to be, Let's say, um, of CCS on DH.
It's US election.
So there's a picture off a set of candidates and those comforts former lattice.
And our job is to find a path through the lattice that sounds good on the target.
Cost function is going to be based on the match between his parametric representation on a particular candidate we're considering.
So those are just different ways of describing the same thing.
Use whichever you're most comfortable with.
Think of it as unit selection with a statistical model doing the target costs job.
Well, think of it statistical Parametric speech synthesis with a rather clever Vukota.
Just to be clear, then the speech parameters, anything that you want, because we don't actually need to be able to reconstruct the way from from them so we could call those a partial synthesis.
It's not a full specifications.
It's just enough to make comparisons with candidates from the database, and we're going to measure this distance, and that's the target cost.
I quite like the following analogy, So let's see if it works for you.
When we generate images by computer off Rheal objects, for example, people.
It's quite usual to start from measurements of real objects and then to make a model.
And then we can control that model, for example, animated, Make it move on.
We render that model to make it look photo realistic, if that's what we want.
So here's how that would work for making a face.
First we get some raw measurement data from a human subject, so some sort of three D scanning device would measure lots of points on the surface of somebody's face and give us this raw data.
That's like the speech database we get in the studio way forms the row daters, high dimensional and hard to work with.
It's not easy to directly, for example, animate this representation.
So we turned that into a Parametric model, which loses some detail.
But gains control.
Yeah, has fewer dimensions to control, but those air now meaningful so we could change the shape of the mouth more easily in this representation, for example, those just parameters.
So to generate the final image, Maybe this is for a movie or a computer game.
We have to give a surface to this set of parameters.
So think of these parameters as the milk caps from so we could just shade that model like this.
For me, that's a bit like Vaux coding.
It looks now like a person.
It's kind of convincing, but there's something unnatural about it.
It's rather smooth, and in this case it doesn't have any colour on.
It doesn't have much texture because that's a very simple rendering from the Parametric model, and if we want to make this look better, we need to put some photo realistic images on top ofthe this shaded model.
So the way that can be done is essentially to take lots of little image tiles and to cover up the shaded model with little photographs in this case, little photographs of skin.
This method of photo realistic rendering by tiling little rial images on top off a mesh of Parametric model is very similar to one particular form.
Off hybrid speech synthesis.

Bonus material: trajectory tiling

A case study based on one of the readings.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
I'm going to describe that one particular form of hybrid speech synthesis now taking it just as a case study.
In other words, there are many other ways of doing this.
I'm picking this particular way of doing it because I think it's very easy to understand.
And I just like the name trajectory tiling and there's the reading available in the reading list.
I strongly recommend this paper to you.
The core idea is very simple.
We're going to generate speech parameters using a statistical model.
In this paper.
It's hidden Markov model.
But if we're repeating this work today would probably just use a deep neural network on those speech parameters are going to be, for example, a spectral envelope, fundamental frequency on energy.
They could be in vocoder, prominent in vain.
And in this paper, basically they are, or they could be simpler.
Could be a lower frame rate or lower spectral resolution would probably still work.
Whatever we do, we're going to generate those privateers from a statistical model.
Then we're just going to go to a unit database and find the sequence of way form fragments that somehow matches.
Those parameters have to measure that match, and then Khun Caffeinated, just as if we were doing unit selection.
So this is a very economical prototypical approached a hybrid speech synthesis.
And that's why I like this paper because it takes this fairly straightforward approach and the cheese very good results in the coming slides.
I'm going to be using this diagram from the paper or parts of it.
This diagram attempts to say everything all in one figure, but we're going to deconstruct it and look at it bit by bit.
Let's start with just a general overview.
We have a statistical model that generates parametric forms of speech.
Here.
It's the fundamental frequency game.
Let's just call that energy on these things called Lion.
Spectral Pairs, which represent the spectral envelope, will say a little bit more about exactly what those are in a moment on.
The paper causes the guiding parameter trajectories.
That's a nice name.
We're going to select way form fragments guided by the specifications.
In other words, we might not slave ish ly obey it exactly.
We allow a bit of a mismatch between the selective way forms on this representation.
Maybe some distance we measure their and we'll be willing to compromise on that in return for good joins.
In other words, we're going to wait the sum of joint costume target costs, just like in unit selection.
So these parameter trajectories are a guide we might not get a speech.
Signal has precise these properties, but we'll be close to them.
Well, then go to the database of speech and pull out lots ofthe speech fragments.
These things here they're called way form tiles by analogy with those little image tiles that we saw earlier.
A little bit confusingly, this paper calls this a sausage.
Let's call it a lattice.
Lattice is what it really is.
And we're going to do the usual thing in units election.
We're going to find the lowest cost path through this network, and we're going to com.
Captain ate the corresponding sequence of way form tiles to produce that output speak signal.
A key component, obviously, is how to measure the distance between a candidate one of these way form tiles on DH.
The specifications, the acoustic specifications coming from the guide Statistical Parametric speech synthesis system, and hmm.
In this case, we need to measure the distance between those two things In order to measure that distance, we obviously have to get these two things into the same domain, the same representation.
We can't measure the difference between away form in the time domain on DH, a spectral envelope we have to convert one to be the same representation is the other on.
The obvious thing to do is to take this bit of speech and extract these same features from it and measure the difference distance between those extracted features on the guiding perimeter trajectories.
That seems reasonable.
We'll see in a moment that we can do actually slightly better than that.
But for now, let's assume we extract parameters from the speech candidate.
And we measured the distance, perhaps just Euclidean distance.
Maybe these things would be normalised between the promises of the speech on the guiding trajectories, and then we just some that overall the frames off a unit, and that will be a target cost of that candidate, we'll come back to join cost in a moment before we go any further.
Well, better just understand what these lines spectral pairs are or l s peace.
I'm going to give you a very informal idea ofwhat they capture and how they're rather different from the cap strum.
So here's the FT spectrum off a frame of voice speech until its voice could.
You can see the harmonics, and it's probably a vowel because it's got some form and structure that looks very obvious.
Let's just extract the envelope lots of ways we could do that such a straight anywhere you like.
We've talked about that before.
The line spectral pairs, quite often called lion spectral frequencies are a way of representing the shape of that spectral envelope, and I'm going to say this rather informally.
This isn't precisely true, but it's approximately the case, and it's a very nice way to understand it.
We have a pair of values representing each peak, each for mint, so let's guess where they might be on this diagram.
They'll be, too, for this, for mint and then to hear, to hear and maybe some representing the rest of the shape like that.
And the key properties of these lines spectral pairs are there more closely spaced when there's a sharper peak in the spectrum.
So think of them as capturing somehow the former frequency and bandwidth, using a pair of numbers.
Now they don't exactly map on performance.
This is a rather informal way of describing things, but I think that's a good enough understanding to go forward with this paper.
So the lion special pairs or spectral frequencies then have a value on that value is a frequency.
When we plot them on this little extract off, the figure we can plot them on a diagram that has time going this way and frequency going that way.
So on the same space as a spectrum would be plotted on each line, spectral frequency clearly changes over time.
It has a trajectory.
Now.
The method in this paper could probably equally well have used, MFC sees.
It just happens to use line spectral pears.
One nice thing about lying spectral pairs is that we can actually draw pictures often like this.
There are meaningful interpret herbal.
That's not the case with them of CCS.
Okay, we've understood line spectral frequencies at least well enough to know what's going on in this picture.
So we've got this special education, which has come out of our statistical model.
Scott Energy F zero on these line spectral frequencies representing the spectral envelope on we've got a candidate way form for my database on.
We're going to try and compare them on.
The way to do that is to convert this way form into the same space as thie parameters coming out of the statistical model.
Okay, so we can't deal down to the way form from the way form.
We will extract some parameters.
I'm going to do that by hand.
So this has got some F zero value.
Got some energy on DH.
It's got these lying spectral frequencies with their trajectories over time.
Yeah, so that's time on DH frequency on that little bit of diagram.
So I've now got the representation for the statistical model in the same domain of the representation off this speech way form.
And now it will be very easy to use any distance measure.
We like Euclidean distance between those two we were doing frame by frame and then just sum them up across all the frames.
So some overtime.
Now there's a problem in doing that.
The problem is that the parameters that we extract from away form will look a little bit different to the ones generated by the hmm.
In particular, they will be noisy on the ones generated by the H ma'am will be rather smooth because of the nature of the statistical model.
Here are two figures illustrating that idea On the left, I've got natural speech with L S F extracted from it on on the right, I've over laid on top of the natural speech Spectra Graham trajectories generated by statistical model Look how much smoother they are.
So there's some systematic mismatch between natural L S F trajectories on one's generated from our statistical model.
Here's another picture of the same thing with the trajectories over laid on top of each other in blue.
I've got trajectories extracted from natural speech and then red I've got was generated from a statistical model.
In this case, it happens to be a Jeep neural network.
Let's zoom in.
And it's really obvious that there's this systematic mismatch now the most obvious suspect, too.
That mismatch is that the blue things, very noisy on the red things very smooth, was actually a deeper problem with a mismatch.
And that's that the statistical model.
We'll make systematic errors in its predictions, but there will be systematic will always make the same error.
So for the same phone in the same context, it will tend to make the same error over and over again, whereas in the natural speech for that phone, in the context, every speak sample will be different.
So this mismatch is a problem on DH.
We have a clever way of getting around it, what we'll do instead of extracting the features from the way forms.
In other words, the candidates well, actually regenerate them for the training data using our chosen statistical model.
And that regeneration is essentially synthesising the training data that seems slightly odd at first.
But when we think about it, deeply will realise this is an excellent way to remove mismatch because the sort of trajectories that we now have for the training data are from the same model that will be using its synthesis time.
So, for example, that will have the same smoothness property, but more importantly and more fundamentally, will make the same systematic errors.
If you're finding that idea a little bit hard to grasp, cast your mind back right to unit selection, where we thought about what sort of labels to put on the database on what sort of things to consider in the target cost.
And we thought about when we label the database.
Do we need to use the economical phone sequence or a very close phonetic transcription that's exact compared tto up speaker set.
And we came to the conclusion that consistency was more important than accuracy because we wanted no mismatch between database on what happens.
It's synthesis time.
That's exactly what's happening here.
The labels that we're putting on the database here are speech parameter trajectories.
The labels are acoustic labels because our target cost function is going to be an acoustic space.
So it's important that those acoustic labels F zero by special frequencies energy.
There's acoustic labels that we're putting a lot of training data look very much like the ones that will get it since this time.
For example, if we asked our hybrid synthesiser to say a sentence from the training data, we like it to retrieve that entire sentence intact.
And that's going to be much more likely if the labels were put on the training data have bean regenerated in this way and not extracted from the natural speech.
So that's our target cost taken care off diving promised directories way, form fragment in the database which we also have trajectories for not extracted from that fragment itself, but regenerated from the model for the entire training data on we can compute the target cost just with any distance function we like between those two.
So the other component we need obviously is a joint cost.
So as we take paths through this lattice, what's called here a sausage? We're considering Con captain ating one candidate with another candidate and we need a joint cost.
Now we could just use the same sort of joined cost from unit selection.
Take the one festival uses waited some ofem, F C C S F zero and energy that would work.
That's fine.
This favour to something a little bit different.
He actually combines join cost with a method for finding a good joint point.
Here's a familiar idea, but used in a different way.
Remember when we were estimating the fundamental frequency or speak signals? We use an idea called auto Correlation.
On cross correlation, we took two signals which are just copies of the same signal on we slid one backwards and forwards with respect to the other looked for similarity, self similarity.
This is just a fancy word for similarity.
Our purpose, then, was to find a shift which corresponds to the fundamental period from which we can get F zero.
What's happening here is essentially the same measure, but used in a different way.
We're now doing it between two different way forms.
This is the candidates of the left.
This is the candidate to the right when we considering where we might join them.
And the joy might just be simple, overlapping out.
So trying to find a place where the wave forms will align with each other with the most similarity.
So we'll take one of them.
And as this diagram implies, sliding backwards and forwards with respect to the other one at each lag, each offset will measure the cross correlation between them within some window, and that will give us a number.
We will find the lag which maximises that number, which maximises correlation, similarity.
Now it's a similarity between two different way forms.
If we can find that point of maximum similarity that suggests this is a really good point to join the way forms, so will line them up that position of maximum similarity and just do some simple fade out of one way form and fading of the next way form on.
Do they overlap? A nod that's for finding a good place to join these two particular candidates.
But that number that we computed that correlation the correlation for the best possible offset is a good measure of how well they join.
And so that's uses the joint cost in this paper.
So for every possible, join every candidate in every possible successive candidate, we put them through this cross correlation, alter them, sliding one backwards and forwards with respect to the other.
Finding the point of maximum similarity between them.
I'm making a note of that similarity value of that cross correlation value on that the joint cost that's put into the lattice for the search and when we do eventually find the best path through the lattice will know exactly where to make the cross phase between the units.
Now it should be apparent from the fact that we doing this cost correlation, which involves trying lots of different offsets or lags.
This is a relatively expensive sort of joint cost to compute, probably going to be a lot more expensive to compute this than it is just our simple Euclidean distance of MFC seas with a single joint point.
But this is doing a bit more than just a joint cost.
It's finding the best possible joint point as well.
Contrast that with what happens in festival, where the joint points are predetermined.
Each die phone has left boundary and right boundary and pick synchronous on.
Those joint points are the same regardless of what we can.
Captain Eytan, that candidate with here, something smarter is happening.
The particular joint point for this candidate will vary depending on what we're going to join it, too, within a range of a few pitch parents.
Now the paper also describes how the underlying hmm system is trained.
We don't need to go into that because we basically understand that it's doing a slightly more sophisticated form of training that we don't need to worry about here because, to be honest, if we were building the system today, we would just use a deep in your network instead of H M.
M's so we could just go on and then summarise what we know now about this method with a rather wonderful name trajectory tiling.
The core idea is simple.
We pick a statistical model.
Here is the H Man.
We generate speech parameters using that statistical model.
Hear those parameters are effectively what we would have used for a vocoder.
In fact, that's probably because they recycled in hmm, system they already had from a complete statistical parametric system.
But they could have used different speech parameters that would've been okay.
And then we basically do pretty straightforward unit selection.
We find the sequence of way from fragments in the paper they called tiles and festival.
We call them candidates on DH.
Then we can captain it that sequence.
And of course, there's some details to each of that.
The special envelope is represented in a very particular way as line spectral frequencies.
That's just a nice representation of spectral envelope in other choices would be possible.
The paper does a standard thing, which is to regenerate the training data with the train statistical model to provide the acoustic specifications of the training data.
In other words, the acoustic labels on the training data.
And as we said, that's the precise of the same reasons as when we use an independent feature formulation.
Target cost.
We prefer consistency and the linguistic labels over faithfulness to what the speaker actually said, Toe err on the side of Kanaan ical pronunciations with just minor deviations on the final.
Nice thing that paper does is it has a joint cost function that just two things at once.
We get two for the price of one.
Not only does it measure mismatch, which is the joint cost that goes into the search as a byproduct, it finds a good concatenation point, and we can remember that.
So when we do choose a sequence of candidates, we know precisely where the best place to overlap.
None of them is, well, what comes next? Well, it's up to, you know, at this point I'm going to stop videos because it's futile to make videos about the state of the art because it's going to change all the time.
You need to now go to the primary literature, and by that I mean journal papers or conference papers, not textbooks.
So although Taylor is excellent, it's dated 2009, so it's not going to tell you about the state of the art and you need to research for yourself.
What the current state of the artists.
I'm not even going to speculate what it might be by the time you're watching this video, that's your job is to go and find primary literature.
Start with recent papers and work your way back to find out what's happening in speech synthesis today.
Whether it's in your networks or in some other paradigm, I've provided a list off the key journals and conferences, good places to start looking anything.
Publishing those venues is worth considering something to read, and that's all, folks.

Download the slides for the class on 2023-02-27

Reading

King: A beginners’ guide to statistical parametric speech synthesis

A deliberately gentle, non-technical introduction to the topic. Every item in the small and carefully-chosen bibliography is worth following up.

Taylor – Chapter 15 – Hidden-Markov-model synthesis

Written with a traditional "starting from automatic speech recognition" viewpoint, you will need to make the connections for yourself to the more general concept of text-to-speech as a regression problem.

Zen, Black & Tokuda: Statistical parametric speech synthesis

A review article that makes some useful connections between HMM-based speech synthesis and unit selection.

Qian et al: A Unified Trajectory Tiling Approach to High Quality Speech Rendering

The term "trajectory tiling" means that trajectories from a statistical model (HMMs in this case) are not input to a vocoder, but are "covered over" or "tiled" with waveform fragments.

Pollet & Breen: Synthesis by Generation and Concatenation of Multiform Segments

Another way to combine waveform concatenation and SPSS is to alternate between waveform fragments and vocoder-generated waveforms.

This module has a two-part quiz – some flashcards, then a speed reading exercise.

As usual, to reveal an answer, click a flashcard.

In HMM-based speech synthesis, why might separate regression trees be employed for F0 and for the spectral envelope?

Think about how a regression tree (a type of CART) performs regression: by asking a sequence of questions about the predictors (i.e., the linguistic context). The order of the questions matters, with those that are most predictive appearing closer to the root of the tree. Since different aspects of the context will be most useful for predicting F0 (e.g., position-in-phrase) than for predicting the spectral envelope (e.g., local phonetic context), separate trees will give better results.

What is the relationship between the regression tree in the "HMM speech synthesis, viewed as regression" video and the tree used to cluster models in the "HMM speech synthesis, described as context-dependent modelling" video?

They are exactly the same thing, it is only the way they were described that differs. You can interpret the tree as either performing regression, or as clustering models together. Whichever way you prefer to understand it, the tree is predicting the parameters of the HMMs, from the linguistic context.

Given that the regression tree is performing the key task of getting from linguistic context to speech (model) parameters, what is the point of the HMM?

The regression tree is not a sequence model: it simply operates on one data point at a time. We need an external timekeeping mechanism. During synthesis, this provides duration information: deciding when to move forward in the linguistic context sequence. During training, it is responsible for aligning the linguistic context sequence with the speech feature sequence (for an HMM, we can use the Viterbi algorithm or Expectation-Maximisation).

Speed reading exercise

You have up to 20 minutes to read Watts et al: From HMMs to DNNs: where do the improvements come from?

Download the paper and set a timer to keep yourself honest!

Then answer the following questions – you are allowed to refer back to the paper, but you must locate the answer as quickly as possible.

Which vocoder is used?

Section 3.1 - STRAIGHT.

What is the frame rate of the speech parameters?

Section 3.1 - "5 msec" which means 200 frames per second.

What are the speech parameters?

Section 3.1 - 60-dimensional mel-cepstral coefficients (MCCs) to represent the shape of the spectral envelope, 25 band aperiodicities (BAPs) which capture the relative amount of aperiodic energy in each frequency band, and fundamental frequency (F0) with the log taken so that it has a more Gaussian distribution ready for modelling.

How many linguistic features are used (if treated as binary) in the HMM systems?

Section 3.2 - 2926 questions, although the authors admit that they are not all useful. Such question sets have evolved over time and are general-purpose rather than data-specific, although they must be specific to the front end and dictionary being used.

How many binary linguistic features are used in the DNN systems?

Section 3.2, paragraph "State-level modelling" - 863

What is the topology (i.e., shape & size) of the DNNs?

Section 3.2, 3rd paragraph - 6 hidden layers with 1024 units in each

What is the accent and gender of the speaker whose data were used?

Section 3.1 - British English male

What style of listening test was conducted?

Section 4.1 - MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor) which places stimuli side-by-side and asks subjects to score them, which also involves ranking them.

What is the "postfiltering" listed in column 7 of Table 2?

Section 3.2, paragraph "Enhancement" - Mel-cepstral domain formant emphasis, which is a manipulation of the cepstral coefficients designed to make formants "sharper", in other words having higher and narrower peaks.

How do separate vs combined stream modelling differ in the two cases of HMMs and DNNs?

For HMMs, separate modelling means dividing the observation vector into streams (MGC, BAP, F0) and using separate decision trees to cluster the Gaussians for each stream. Combined modelling treats the observation as a single vector and uses a single decision tree to cluster the Gaussians. For DNNs, separate modelling means three independently-trained neural networks each of which outputs one of the three streams, versus a combined single neural network that outputs the complete observation vector.

What were the listening conditions of the listening test?

Section 4.1, last sentence - Sound-insulated booths and headphones.

Remember that you’ll need to do the readings to fully understand this topic. You might find that things will become clearer when we move on to Neural Networks, because you’ll start to see why we’ve been insisting on describing TTS as a regression problem.

Hybrid synthesis (not examinable for 2023-24) represented the commercial state of the art for some years, but has mostly been replaced by the fully neural approaches covered in the remainder of the course.

Module 7 – Statistical Parametric Speech Synthesis

Reading

Speed reading exercise