Build the voice

The final stages of building the voice involve creating the information needed by the target and join costs, plus the representation of the speech needed for waveform generation.

We’re nearly there, and the remaining steps are mostly fully automatic.

  • Utterance structures

    The target cost in Festival is computed using linguistic information, so we need to provide that information for all the candidate units in the database. This information is stored in utterance structures.

  • Pitch tracking

    One component of the join cost is the fundamental frequency, F0. This is extracted separately from the pitch marks, although the two things are obviously closely related.

  • Join cost coefficients

    The join cost measures potentially-audible mismatch at the points where candidate units from the database are joined. To make the runtime synthesis faster, we can precompute the acoustic features that are used by the join cost.

  • Waveform representation

    Although unit selection is essentially the concatenation of pre-recorded waveform fragments, we may store those waveforms in terms of source-filter model parameters.