Label the speech

The labels are obtained from the text using the front-end of the text-to-speech system, but we then need to align them to the recorded speech using a technique borrowed from automatic speech recognition.

Before you continue, make sure you have completed the following:

  1. you have finished recording at least the ARCTIC ‘A’ set;
  2. you have a single utts.data file with a single line entry for each utterance;
  3. you have checked your recorded data, and have a wav folder containing an individual .wav file for every utterance in utts.data;
  4. you have checked to ensure the file naming and numbering is correct.

If you haven’t recorded the full ARCTIC script, then edit utts.data (obviously you should make a backup copy first) so that it only includes prompts for which you have a corresponding wav file.

The next stage is to create time-aligned phonetic labels for the speech, using forced alignment and the HTK speech recognition toolkit. First you must set up a directory structure for HTK:

bash$ setup_alignment

This creates a directory called alignment containing various HTK-related files. The script will also tell you that you need to make a couple of files: you will do that in the next step.

  • Choose the dictionary and phone set

    Various dictionaries are available, depending on your accent. The choice of dictionary also determines which phone set you will use. You might need to add some words to the dictionary, to cover all the words in your additional material.

  • Time-align the labels

    The database needs time-aligned labels. Consistency between these labels and the predictions that the front-end will make at runtime is important, so we will use the same front-end to create the initial label sequence, then used forced-alignment to put timestamps on those labels.

Related forums