Choose the dictionary and phone set

Various dictionaries are available, depending on your accent. The choice of dictionary also determines which phone set you will use. You might need to add some words to the dictionary, to cover all the words in your additional material.

You need to decide which accent of English your own speech is closest to. You have a choice between General American English, British English, or Edinburgh Scottish English. The choice you make here will determine which dictionary and phone set you need to use for the remainder of the assignment. All the instructions will assume you are using the British English dictionary unilex-rpx, so if you choose to use a different dictionary you need to substitute unilex-rpx with one of the other options shown below, in all commands that include unilex-rpx.

unilex-gam – General American English
unilex-rpx – British English (RP)
unilex-edi – Scottish English (Edinburgh)

Define the phone set

Copy the files which define the phone set to your alignment directory:

bash$ cp $MBDIR/resources/phone_list.unilex-rpx alignment/phone_list
bash$ cp $MBDIR/resources/phone_substitutions.unilex-rpx alignment/phone_substitutions

The phone_list file contains a list of phones in your phone set. There are some special phones included, as is common in automatic speech recognition. If X is a stop or affricate, then a X_cl is added to label the closure portion. The label sp (short pause) is added for inter-word silences and sil for longer silences (at the start and end of each utterance).

The phone_substitutions file contains a list of possible substitutions that the aligner is allowed to make. These are restricted to vowel reduction; e.g., the rule ‘aa @‘ means that aa can be labelled as @ (schwa), if that is a more likely label, given the trained acoustic model.

For your first voice build, skip the next optional step and instead create an empty dictionary. Here’s one way to do that:

bash$ touch my_lexicon.scm

Optional: dealing with words that are not in the dictionary

Since the forced alignment will produce phonetic labels from the speech and their word transcriptions, it needs to know the pronunciation of each word. In speech synthesis we would use letter-to-sound rules for all unknown words at runtime, but that isn’t accurate enough for labelling the speech data. Remember that any mistakes in the recorded database will have a direct effect on the synthetic speech. Therefore, you need to ensure that every word in your script is in the dictionary.

Checking your script against the dictionary

bash$ festival $MBDIR/scm/build_unitsel.scm
festival> (check_script "utts.data" 'unilex-rpx)

Festival will tell you about any out-of-dictionary words. It will also tell you what pronunciation the letter-to-sound rules predict, which may or may not be correct. If you find any out-of-dictionary words, create a file (in your preferred plain text editor, such as Atom) called my_lexicon.scm which has this format:

(setup_phoneset_and_lexicon 'unilex-rpx)

(lex.add.entry '("pimpleknuckle" nn (((p i m) 1) ((p l!) 0) ((n uh) 1) ((k l!) 0))))
(lex.add.entry '("womble" nn (((w aa m) 1) ((b l!) 0))))

include the first line, but remember to adjust the name of the dictionary to whichever one you chose earlier.

To work out the correct pronunciation for a word, start festival, run the check_script command as above (which ensures that the correct dictionary is loaded), then use the command lex.lookup to find the pronunciations of similar-sounding words to base your pronunciation on. If you have a strong non-native accent, don’t try to match the actual sounds you are using, but instead try to write pronunciations that are consistent with the pronunciations of other similar words that you would pronounce in the same way. You are aiming to be consistent across all the entires in the dictionary, rather than faithful to your own fine phonetic detail.