Build your own unit selection voice

Record your speech and build a unit selection voice for Festival. Create variations of the voice, add domain specific data, or vary the database size. Evaluate with a listening test.

Credits: this exercise closely follows CSTR’s multisyn voice building recipe

  • Tools required

    Only needed if you are setting this exercise up on your own. Edinburgh students should skip this step.

  • Introduction

    An overview of the complete process of voice building, and some tips for success.

  • Prepare your workspace

    We're going to be generating quite a lot of different files, so we need a well-organised workspace in which to keep them.

  • Milestones

    To keep on track, check your progress against these milestones. Try to stay ahead of them if you can.

  • The recording script

    Because unit selection relies so heavily on the contents of the database, we need to think carefully about exactly what speech we should record.

  • Make the recordings

    With our carefully chosen script, we now need to go into the recording studio and ask our voice talent to record it. Consistency is the key here, especially when the recording is done over multiple sessions.

  • Prepare the recordings

    Move your recordings into the workspace, convert the waveforms to the right format, and do some sanity checking.

  • Label the speech

    The labels are obtained from the text using the front-end of the text-to-speech system, but we then need to align them to the recorded speech using a technique borrowed from automatic speech recognition.

  • Pitchmark the speech

    The signal processing used for waveform concatenation is pitch-syncronous, so that requires the speech database to have the individual pitch periods marked.

  • Build the voice

    The final stages of building the voice involve creating the information needed by the target and join costs, plus the representation of the speech needed for waveform generation.

  • Run the voice

    We're done! Time to find out what it sounds like...

  • Improvements and variations

    It would take too long to tune every aspect of the system, but we can still identify some problems and see how to fix them. It's also easy to vary the contents of the database to discover the effect on the synthetic speech.

  • Evaluation

    The main form of evaluation should be a listening test with multiple naive listeners. But there are other ways to evaluate, and potentially to improve, your voice.

  • Writing up

    Because you kept such great notes in your logbook (didn't you?), writing up will be easy and painless.