Build your own unit selection voice

Record your speech and build a unit selection voice for Festival. Create variations of the voice, add domain specific data, or vary the database size. Evaluate with a listening test.

Credits: this exercise closely follows CSTR’s multisyn voice building recipe

  • Tools required

    Only needed if you are setting this exercise up on your own. My Edinburgh students can skip this step.

  • Introduction

    An overview of the complete process of voice building, and some tips for success.

  • Prepare your workspace

    We're going to be generating quite a lot of different files, so we need a well-organised workspace in which to keep them.

  • Milestones

    To keep on track, check your progress against these milestones. Try to stay ahead of them if you can.

  • The recording script

    Because unit selection relies so heavily on the contents of the database, we need to think carefully about exactly what speech we should record.

    • The utts.data file

      This file is the main index of the unit selection database. Festival uses it to discover which files it can…

    • Adding your own material

      Whilst the ARTIC script gives general diphone coverage, it's not ideal for synthesising all types of sentence. You can try…

    • Automatic text selection

      This is an 'optional extra' and not all students will attempt it, but how about implementing your own greedy text-selection…

  • Make the recordings

    With our carefully chosen script, we now need to go into the recording studio and ask our voice talent to record it. Consistency is the key here, especially when the recording is done over multiple sessions.

  • Prepare the recordings

    Move your recordings into the workspace, convert the waveforms to the right format, and do some sanity checking.

    • Endpointing

      If you have excessive silences at the start or end of many of your recordings, you might want to endpoint…

  • Label the speech

    The labels are obtained from the text using the front-end of the text-to-speech system, but we then need to align them to the recorded speech using a technique borrowed from automatic speech recognition.

    • Choose the dictionary and phone set

      Various dictionaries are available, depending on your accent. The choice of dictionary also determines which phone set you will use.…

    • Time-align the labels

      The database needs time-aligned labels. Consistency between these labels and the predictions that the front-end will make at runtime is…

  • Pitchmark the speech

    The signal processing used for waveform concatenation is pitch-syncronous, so that requires the speech database to have the individual pitch periods marked.

  • Build the voice

    The final stages of building the voice involve creating the information needed by the target and join costs, plus the representation of the speech needed for waveform generation.

    • Utterance structures

      The target cost in Festival is computed using linguistic information, so we need to provide that information for all the…

    • Pitch tracking

      One component of the join cost is the fundamental frequency, F0. This is extracted separately from the pitch marks, although…

    • Join cost coefficients

      The join cost measures potentially-audible mismatch at the points where candidate units from the database are joined. To make the…

    • Waveform representation

      Although unit selection is essentially the concatenation of pre-recorded waveform fragments, we may store those waveforms in terms of source-filter…

  • Run the voice

    We're done! Time to find out what it sounds like...

  • Improvements and variations

    It would take too long to tune every aspect of the system, but we can still identify some problems and see how to fix them. It's also easy to vary the contents of the database to discover the effect on the synthetic speech.

    • Find and fix a labelling error

      To see, in principle, how we could improve the labels for the whole voice, we will just identify and then…

    • Vary the contents of the database

      Make some simple variations on your voice, by excluding parts of the database.

    • Introduce deliberate errors

      By deliberately varying some aspects of the system, you can discover how much effect they have on the overall quality…

    • Target cost weight

      Adjust the relative weight between the target and join cost.

    • Join sub-cost weighting

      Vary the relative weightings of the join sub-cost component (F0, power, spectrum).

    • Pruning

      Festival's Multisyn unit selection engine prunes the candidate lists, and performs more pruning during the search.

  • Evaluation

    The main form of evaluation should be a listening test with multiple naive listeners. But there are other ways to evaluate, and potentially to improve, your voice.

  • Writing up

    Because you kept such great notes in your logbook (didn't you?), writing up will be easy and painless.