Speech Synthesis

Following on from the introductory material in Speech Processing, we move on to more sophisticated ways to generate the waveform, from unit selection to statistical parametric models. We also cover some more advanced speech signal processing.

This course is taught at the University of Edinburgh as the Speech Synthesis course, at advanced undergraduate and Masters levels. Students should normally have completed the Speech Processing course first, which includes material on the Text-to-Speech front end. In this Speech Synthesis course, the focus is mostly on waveform generation.

  • Weekly schedule

    The calendar shows which module you need to complete before each week's lecture. It also lists lab times and specifies the coursework deadline.

  • Readings

    You will find reading lists within each module. Here, you will find the same readings arranged into alphabetically-sorted lists, broken down by module or importance.

  • Module 1 - introduction

    This module contains some introductory material and speech samples, to accompany the first lecture, which is an introduction to the course.

  • Module 2 - unit selection

    Concatenating recordings of natural recorded speech waveforms can provide extremely natural synthetic speech. The core problem is how to select the most appropriate waveform fragments to concatenate.

  • Module 3 - unit selection target cost functions

    The target cost is critical to choosing an appropriate unit sequence. Several different forms are possible, using linguistic features, or acoustic properties, or a combination of both.

  • Module 4 - the database

    The quality of unit selection depends on good quality recorded speech, with accurate labels

  • Module 5 - evaluation

    How do we know how good our synthesiser is? Can we use formal evaluation to decide how to improve it?

  • Module 6 - speech signal analysis & modelling

    Epoch detection, F0 estimation and the spectral envelope. Representing them for modelling. We also consider aperiodic energy. Then, we can analyse and reconstruct speech: this is called vocoding.

  • Module 7 - Statistical Parametric Speech Synthesis

    After establishing the key concepts and motivating this way of doing speech synthesis, we cover the Hidden Markov Model approach.

  • Module 8 - Deep Neural Networks

    The use of neural networks is motivated by replacing the regression tree, which is used in the HMM approach, with a more powerful regression model.

  • Module 9 - sequence-to-sequence models

    True sequence-to-sequence models improve over frame-by-frame models by encoding the entire input sequence then generating the entire output sequence

  • The state of the art

    The content of this part of the course is updated each year. We will cover the latest developments.