Text-to-Speech as a regression problem

By describing text-to-speech as a sequence-to-sequence regression problem, we gain insight into the different ways we might tackle it.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
the goal of this module is to introduce you to statistical Parametric speech synthesis.
We'll start with what, until fairly recently, was the standard way of doing that, which is to use hidden Markov models.
But in fact, we realised that it's just a regression tree on the hidden Markov model is not doing so much just the model of the sequence.
But before we get there, we'll set up the main concept off framing speech synthesis as a sequence to sequence regression tusk from a sequence of linguistic specifications to a sequence off either speech parameters or directly to away form.
So you already need to know about unit selection.
Speak synthesis at this point, and specifically, you need to know how this independent feature formulation target cost uses the linguistic specification by clearing each feature individually in the i f F target cost function there, queried as either a match or a mismatch between a target specifications on the candidate.
In other words, linguistic features of treaters, discreet things we're going to keep doing that.
We're actually going to reduce them down to binary features that either true or false, you also know something about the joint cost whose job is to ensure some continuity between the selected candidates so that they can coordinate.
In the case of unit selection on DH in the previous module, we talked about modelling of speech signals.
We generalise the source philtre model to think just about spectral enveloping excitation, and we prepared the speech features ready for statistical modelling.
For example, we decided we might represent the spectral envelope as the cap from So let's see where we are in the bigger picture of things in its election is all about choosing candidates.
Let's just forget about the acoustic space formulation way of doing that.
For now.
We'll come back to it much later.
Let's just consider this independent feature formulation, which is based only on the linguistic specification, because that's going to be the most similar thing to statistical parametric synthesis, where the input is that linguistic specifications.
We mentioned a few different ways that you might represent the speech parameters for modelling, but the key points are to be able to separate excitation, a spectral envelope on DH tohave, a representation from which you can reconstruct away form.
So we're not going to think for now about directly generating way forms from models were always going to generate speech parameters, at least in this module.
So we need some statistical model, then to predict their speech parameters.
That's the output on the input is the linguistic specification, which we're going to represent as a whole bunch of binary features.
And that's just a regression task.
We've got inputs and outputs.
We've got predictors, onda predict e.
So what we're talking about now is this statistical parametric synthesis method, based on some statistical model on the model, is implementing a regression function on DH.
In abstract terms, the input is linguistic specification not necessary represented in this nice, structured way, but somehow flattened.
Now the output could be the speech way form.
But in fact, most methods until very recently did not directly generate a speech way form, but instead generated Parametric form of that, which is why we had the previous model on representing speech ready for modelling.
Let's go straight in then to thinking about text to speech as a sequence to sequence regression tusk.
So it's a little bit abstract.
We're just setting up the problem, thinking about inputs and outputs.
When we've done that, we can have a first attempt at doing this regression, and that's going to be with a fairly conventional method, which in the literature is called Hidden Markov model speech synthesis.
But it will be really a lot better to call it regression tree speech synthesis, and I will see.
The reason for that is when we get to it, it's a regression tusk Onda Regression task needs an input.
So what is that? What are the input features? Well, they're just the linguistic features.
We've already said that this structure representation, which might be something that's stored inside software such as festival whilst linguistically meaningful, is not so easy to work with in machine learning in unit selection.
In the independent feature formulation, we essentially queried individual features and we looked at match or mismatch between them.
We're going to take that one step further.
We're now going to flatten the specifications onto a fanatic level.
So what does that mean in practise? Consider this one sound here.
The Schwab, in the word that we're going to have a model of that on the model, is going to have a name which captures the linguistic context in which that sound is a caring, so left fanatic context, right, fanatic context.
And then whatever other context we think is important here.
I've put some personally features my notation for this frenetic part.
This is a Quinn phone plus and minus two phones is in HD k style on.
The reason for that is that the standard toolkit for doing hmm based speech synthesis is an extension of H.
D.
K.
I will see in the little bit that will need to represent this other part in a machine friendly formas.
Well, I've left it human friendly for now.
So how do we get this and make it look like a feature vector ready for the machine learning task of regression? We need to do something to vector rise this representation on.
We're going to actually be quite extreme about that input feature.
Vector.
It's essentially going to be a vector of binary features on each Bino feature is just going to query one possible property off this linguistic specifications.
This vector he's going to be quite large, much larger than drawn on.
This picture is going to have hundreds of elements on each element captures something about the linguistic specification.
Let's do an example.
Let's imagine that this element of the feature vector here captures a feature is the current phone aime ce soir, and in this case it is.
And so that value would be one on DH.
These other features might be capturing is the current phoney some other value, and it's not so There were lots and lots of zeroes around it.
What we got there is some encoding of the current phone aim, and it's called one Hot, also called one of an or one of K, and we can see that it's very sparse.
There's a one capturing the identity of the current phone on the lots of zeros, saying that it's not any of the other phones.
Well, then, use other parts of this feature vector to capture context features in the same one hot coating on the remainder of it to capture all of these other things again reduced to one hot codings.
So this big binary vector, then, is mostly zeros with a few sprinkled values of one, and those ones are capturing the particular specifications here.
As it stands, we're going to have one feature vector for each phone in the sequence till have a feature vector for this and for this and for each of them.
So I have a sequence off each directors one per phone in the sentence that we trying to say.
So the frame rate, if you like the clock speed is the phone.
That's the input features dealt with.
What about the output features? Well, we already know the answer that from the last module or we have to do is stack them up into a vector for convenience.
So quite a lot of that Victor's going to be taken up representing the spectral envelope.
Many tens of coefficients, perhaps capital coefficients.
We'll need to take one element in which to put F zero on DH will have some representation off the a period a city, perhaps as those banned a period C parameters the ratio between periodic and non periodic energy in each frequency band of the spectrum.
Now, at this point, I've just naively stacked them all together in one big factor off numbers that captures the entire acoustic specification of one frame off speech.
So the frame rate here with clock speed, as I called it before, is going to be the acoustic frame perhaps 200 frames per second.
We've reduced the problem then to the following.
We have some sequence of input vectors, each one of those say this one here is going to be for a phone in our input sequence.
Maybe that's the Schwab.
And so we've got some clock running here, so we have time.
But the years of time here, a linguistic units here is the phone.
We've got some output sequence.
It also has a time access, but in different units.
The units are now acoustic Frame's fixed clock speed, perhaps five milliseconds.
So 200 per second on the problem now is to map one Sequels to the other.
I hope it's immediately obvious that there's too difficult parts to that.
Mapping one is simply that a linguistic specifications.
Mapping, too.
An acoustic frame is a hard regression problem.
We've got to predict the acoustics off a little bit of speech based on the linguistic specifications.
That's hard.
And the second hard part of the problem is that they're different clock speeds.
We have to make some relationship between the two.
We have to have a linguistic clock ticking along like this marked the line that up with our acoustic clock ticking along like this.
So that's going to involve somehow model in duration.
So the duration is going to tell us how long each linguistic unit each phone lasts.
In other words, how many acoustic frames to generate for it before moving on to the next linguistic unit on generating its acoustic specification and so on, all the way to the end? So that sets up the problem.
The sequence to sequence mapping and it's got at least two hard parts.
The regression part, which is making the prediction of the acoustic properties on DH.
The sequence part.
The two clock rates being in quite different domains.
Once in linguistic domain, one's in the time domain.
We're going to need to find some machinery to do those two parts of the problem for us on our first solution is going to use two different models.
We're going to find a model that's good for sequences of things.
We're going to find a model that's good for regression, and they're two different things.
So we're going to use actually two different models and combine them