Speech Synthesis

Following on from the introductory material in Speech Processing, we move on to more sophisticated ways to generate the waveform, from unit selection to statistical parametric models. We also cover some more advanced speech signal processing.

This course is taught at the University of Edinburgh as the Speech Synthesis course, at advanced undergraduate and Masters levels. Students should normally have completed the Speech Processing course first, which includes material on the Text-to-Speech front end. In this Speech Synthesis course, the focus is mostly on waveform generation.

Copies of the videos in this course are gradually becoming available on YouTube, in case you prefer to watch them there (if that’s the case, I’d be interested to hear why…).

  • Weekly schedule

    The calendar shows which module(s) you need to complete the videos and essential readings for, before each week's lecture. It also lists lab times and specifies the coursework deadline.

  • Readings

    You will find reading lists within each module. Here, you will find the same readings arranged into alphabetically-sorted lists, broken down by module or importance.

  • Module 1 - introduction

    This module contains some introductory material and speech samples, to accompany the first lecture, which is an introduction to the course.

  • Module 2 - unit selection

    Concatenating recordings of natural recorded speech waveforms can provide extremely natural synthetic speech. The core problem is how to select the most appropriate waveform fragments to concatenate.

  • Module 3 - unit selection target cost functions

    The target cost is critical to choosing an appropriate unit sequence. Several different forms are possible, using linguistic features, or acoustic properties, or a combination of both.

  • Module 4 - the database

    The quality of unit selection depends on good quality recorded speech, with accurate labels

  • Module 5 - evaluation

    How do we know how good our synthesiser is? Can we use formal evaluation to decide how to improve it?

  • Module 6 - speech signal analysis & modelling

    Epoch detection, F0 estimation and the spectral envelope. Representing them for modelling. We also consider aperiodic energy. Then, we can analyse and reconstruct speech: this is called vocoding.

  • Module 7 - statistical parametric speech synthesis

    After establishing the key concepts and motivating this way of doing speech synthesis, we cover the Hidden Markov Model approach.

  • Module 8 - speech synthesis using Neural Networks

    The use of neural networks is motivated by replacing the regression tree, which is used in the HMM approach, with a more powerful regression model.

  • Module 9 - hybrid speech synthesis

    Using a statistical model to guide unit selection, as part of an Acoustic Space Formulation target cost function, brings together the robustness and control of the model with the high naturalness of concatenated waveforms.