Getting started

A first look at Festival and how we use it in interactive mode on the command line.

Accessing Festival

The instructions assume you are using the installation of Festival on the computers in the PPLS Appleton Tower (AT) labs. You can work on the computers in the actual physical lab or you can use the the remote desktop service (see Module 0 for instructions).

Note: The PPLS AT Lab computers/remote desktop we are using for this course are completely separate from Informatics remote desktop/DICE! Importantly, you won’t have access to the voice setup we will be using on DICE.

It is possible to install Festival directly onto your computer but this is not necessary for students taking Speech Processing. Installation requires a unix like environment (Linux or MacOS, or a Linux style terminal runnning on Windows) and compiling code from source (see guidance here: Install Festival). If you’ve never compiled code before, and don’t have much experience with the unix command line, your best bet is to use the PPLS AT Lab computers.

Accessing Festival Remotely

You can use the installation of Festival on the Appleton Tower lab servers using the remote desktop service. To connect using the remote desktop, follow the instructions here: Module 0 – computing requirements

Once you’ve started the remote desktop and logged in (with your UUN and EASE password), you can open the Terminal app by going to the “show apps” button (bottom left on dock) and searching for ‘Terminal’. Right click on the Terminal icon in the dock and pin it to the dock for easy access.

When you are finished, remember to log out: top right of the screen > power button > Log out (don’t power down!)

Assignment Data

If you are using the remote desktop to access the AT lab computers (or are physically in the lab), all the relevant data is already there for you on the linux machines.

Only if you have installed Festival on your own computer: you will need to get the voice database and dictionaries used to run the voice. You can find instructions by following this link.

Start Festival

Festival has a command line interface which runs in the terminal (i.e. the unix bash shell). To do this in the PPLS AT lab, you’ll need to:

Make sure the computer is booted into Linux (if it is in Windows, restart the machine and select the penguin (the Linux mascot!) when presented with the choice);
open a terminal by going to the “show apps” button (bottom left on dock) and searching for ‘Terminal’. Right click on the Terminal icon in the dock and pin it to the dock for easy access.

Now open a Terminal and run Festival by typing in festival at the prompt $:

$ festival

You should see some text about the version of Festival we are using (Festival 2.5.0):

Festival Speech Synthesis System 2.5.0:release December 2017
Copyright (C) University of Edinburgh, 1996-2010. All rights reserved.

..etc

and the prompt will also change to show the following:

festival>

This new prompt means that Festival is running; any commands that you type must now be in the Scheme language and will be interpreted by Festival rather than by the bash shell.

You will be pleased to know that Festival’s command-line interface uses the same keyboard shortcuts as the bash shell (e.g., TAB completion, ctrl-a, ctrl-e, ctrl-p, ctrl-n, up/down/left/right cursor keys, etc.). Here’s a nice cheat sheet for common bash commands. For a comprehensive list of these shortcuts, see the Wikipedia entry for GNU Readline.

If you get into trouble at any point and need to exit the current command, use ctrl-c. This applies to both Festival and the bash shell.

It’s really worth learning these keyboard shortcuts because they also apply to the bash shell and will save you a lot of time.

Make Festival speak

Synthesise some sentences to become familiar with to the Festival command line.

Festival contains a number of different synthesis engines and for each of these, several voices are available: the quality of synthesis is highly dependent on the particular engine and voice that is being used.

Using the SayText command

By default, Festival will start with a rather old diphone voice, which does not sound great, but is fast and intelligble:

festival> (set! myutt (SayText "Welcome to Festival"))

This command combines a bunch of different things: It converts the input text “Welcome to Festival” to a linguistic specification and uses that specification to generate speech by selecting appropriate diphones. The set! myutt part of the comment tells Festival to store all the information relating to how the utterance was synthesized in the variable called myutt. You can change the name of this variable to something else when you run the command above. This is handy when you want to look at different examples. You just have to remember which variable refers to which utterance.

Set which voice to use

You can set the voice to the one we will use in the assignment by typing the following after starting festival:

festival> (voice_cstr_edi_awb_arctic_multisyn)

You’ll see some “EST warning” messages printed to the screen, but you can ignore those.

Now try getting generating a sentence as you did before with the SayText command. Can you hear a difference between the two voices?

Generating speech without playing it

To generate an utterance without playing it, use the following steps instead of SayText:
festival> (set! myutt (Utterance Text "Hello")) festival> (utt.synth myutt)

Then you can save the utterance myutt a wave file as “myutt.wav” with the following command:
festival> (utt.save.wave myutt "myutt.wav" 'riff)

This will save a file called myutt.wav in whatever directory you started festival in. If you just opened a terminal and started festival without changing directories, you will be in your ‘home’ directory. You can check the folder on the desktop called [your username]’s home and see if a new file wav file has appeared there. Otherwise you can exit festival by pressing ctrl-d and typing the command pwd . This will tell you where you currently are – your “present working directory”.

Scheme, and lots of brackets

When you issue a command to Festival you must put it in round brackets (...) – if you do not, it will generate an error. You are using a language called Scheme.

Scheme is a LISP-like language used as the interface to Festival. When you run Festival in interactive mode, you talk to Festival in Scheme. Fortunately, we’re not going to have to learn too much Scheme. All you need to know for now is that the basic syntax is (function_name argument1 argument2 ...).

In Scheme, all functions return a value, which by default is printed after the function completes. The SayText function returns an Utterance structure, so just prints # after the completion of the function. A variable (myutt in this case) can be set to capture the return value, which will allow us to examine the utterance after processing. This is done using the set! command (note the two sets of brackets):

festival> (set! myutt (SayText "Welcome to Festival"))
#

The TTS process

We can now examine the contents of the myutt variable. The SayText function is a high level function which calls a number of other functions in a chain. Each of these functions performs a specific job, such as deciding the pronunciation of a word, or how long a phone should be. We’ll be running these step-by-step later on.

The TTS process in Festival is a pipeline of sub-processes, which build up an Utterance structure in stages. This building process takes the original text as input and adds more and more information, which is stored in the utterance structure. In Festival, a unified mechanism for representing all types of data needed by the system has been developed: this is called the Heterogeneous Relation Graph system, or HRG for short.

Each Relation in an HRG is a structure that links items of a particular linguistic type. For example, we have a Word relation which is a list linking all the words, and a Segment relation which links all the phones etc. Relations can take different forms: the most common types are linear lists and trees.

Each module in Festival takes a number of relations as input and either creates new relations as output, or modifies the input ones. The vast majority of modules only write new information, leaving all information in the input untouched (there are a few exceptions, such as post-lexical processing). Because of this, examining the contents of the relations in an utterance after processing gives an insight into the history of the TTS process.

Different configurations of Festival can use vary with respect to their use of HRGs, and which modules they call.

Examining a saved object

Once you have synthesised an utterance you can do lots of things with it. Here are a few examples.

festival> (utt.play myutt)
festival> (utt.relationnames myutt)
festival> (utt.relation.print myutt 'Word)
festival> (utt.relation.print myutt 'Segment)

You can get a list of the relations that are present in a synthesised utterance by using the utt.relationnames command.

Relations that are lists can easily be printed to the screen with the utt.relation.print command. Try this with all of the relations in an utterance. Some of them won’t reveal useful information, others will.

The output from (utt.relation.print myutt 'Word) may look like this:

()
id _3 ; name hello ; pos_index 16 ; pos_index_score 0 ; pos uh ;
        phr_pos uh ; phrase_score -13.43 ; pbreak_index 1 ;
        pbreak_index_score 0 ; pbreak NB ;
id _4 ; name world ; pos_index 8 ; pos_index_score 0 ; pos nn ;
        phr_pos n ; pbreak_index 0 ; pbreak_index_score 0 ;
        pbreak B ; blevel 3 ;
nil

Each data line starts with an id number like id _3 then a series of features follow separated by semicolons. Each feature has a name and a value, e.g., feature name: pos, feature value: uh.

Examining the processing steps

Tokens – First the text is split into Tokens. Look at the Token relation, where an item is created for each component of the text you input. The Token relation will still have digits and abbreviations in it.

Words – The Tokens are then converted to Words, abbreviations and digits are processed and expanded. Look for this in the Word relation.

Part of Speech Tagging – Each word is tagged with its part of speech, which is added as a feature to the Word relation.

Pronunciation – The pronunciation of each word is determined and the Syllable and Segment relations created. Examine these: the syllable relation is not very interesting as there is very little information here, just a count of the syllables.

You can look up the pronunciation of a word yourself with the function lex.lookup

festival> (lex.lookup "caterpillar")
("caterpillar" nil (((k ae t) 1) ((ax p) 0) ((ih l) 1) ((er) 0)))

The actual pronunciation returned depends on which lexicon a particular voice uses, and whether the word is in the lexicon or if Festival has to predict the pronunciation using letter-to-sound rules.

Try looking up the pronunciation of some real words, and some made up ones.

Accent Prediction – An intonation module assigns pitch accents (and other intonational events) to syllables. A number of different modules exist within Festival, operating with a number of intonation models including ToBI and Tilt. The assignment voice (awb) doesn’t actually do accent prediction, but you can see what this would look like by trying the older diphone synthesis voice, kal, which does:

To switch to the kal voice, enter the following in festival:

festival> (voice_kal_diphone)

Now, generate a new utterance and look at the associated IntEvent relation to see which pitch events have been assigned. From the pitch events and the predicted durations, a pitch contour is generated. This contour is a list of numbers which specify the pitch at each point for the resulting waveform. There is no easy way to view the pitch contour.

You can use the following command to change back to the assignment voice:

festival> (voice_cstr_edi_awb_arctic_multisyn)

Waveform generation – The Unit relation is created by making a list of diphones from the segments and the information about the speech needed for synthesis is copied in. The Unit relation contains features with values in square brackets [...] These are references to the actual speech used to synthesise these units.

Quit Festival

festival> (quit)

or use ctrl-d, just like in the bash shell. Festival remembers your command history inbetween sessions (again, just like bash). Next time you start Festival you can use the up cursor key to find previous command, and then hit ‘Enter’ to execute them again. Of course, Festival does not remember the values of variables (e.g., myutt in the above example) between sessions.

Transferring data from the AT lab servers

To get your data (e.g. generated wav files) from the AT lab servers (e.g. remote desktop) you can either use a terminal based command like rsync, an SFTP client like FileZilla or WinSCP, or a file hosting service like OneDrive or Google Drive (every University of Edinburgh student should have OneDrive storage associated with your student account).

For example, the following copies the file myutt.wav that’s in ~/Documents/sp/assignment1 to the directory where you’re running the rsync command from on your own computer:

rsync -avz s1234567@ppls-atl-1071.ppls.ed.ac.uk:Documents/sp/assignment1/myutt.wav ./

where ppls-atl-1071.ppls.ed.ac.uk is the address of one of the PPLS AT lab remote desktops. You can see the list of PPLS AT lab remote desktop addresses here: https://resource.ppls.ed.ac.uk/whoson/atlab.php

Note: The previous command will only work if you’ve already made the directories Documents/sp/assignment1 in your home directory on the AT lab servers. If you haven’t done this, you can skip this for now and try it after you’ve created some files.

You can also use a file transfer app like FileZilla. In this case, you need to set the remote host to scp1.ppls.ed.ac.uk. For FileZilla, go to File > Site Manager, then set the protocol to SFTP, the host as scp1.ppls.ed.ac.uk, and use your UUN as username and EASE password as the password. After connecting you should see your home directory on the AT lab servers as the remote site. You can then drag files from remote site side to the appropriate place in the local site side.

What you should now be able to do

start Festival and make it speak using SayText
capture the Utterance structure returned by SayText
look inside the Utterance structure at the Relations
have an initial understanding of what Relations are, but not yet the full picture
use some of the keyboard shortcuts that are common to Festival and the bash shell
save a synthesized utterance as a .wav file and transfer it to your own computer.