Baidu Deep Voice explained Part 2 — Training

Dhruv Parthasarathy
Athelas
Published in
8 min readMar 11, 2017

--

Deep Voice explained Part 2 — The Training Pipeline.

Arxiv Link: https://arxiv.org/abs/1702.07825

Institution: Baidu Research

This is the second post covering Baidu’s Deep Voice paper that applies Deep Learning to Text to Speech Systems.

In this post, we’ll cover how we actually train each part of this pipeline using labeled data.

Background Material

Summary of the Inference Pipeline

The last post covering the inference pipeline is here for your reference:

Here’s the quick summary:

The inference pipeline for Deep Voice.
  1. Convert text into phonemes in the Graphene-to-Phoneme model.
  • “It was early spring” -> [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]

2. Predict the durations and frequencies of each phoneme.

  • [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .] -> [IH1 (140hz, 0.5s), T (142hz, 0.1s), . (Not voiced, 0.2s), W (140hz, 0.3s),…]

3. Combine the phonemes, the durations, and the frequencies to output a sound wave that represents the text.

  • [IH1 (140hz, 0.5s), T (142hz, 0.1s), . (Not voiced, 0.2s), W (140hz, 0.3s),…] -> Audio

But how do we actually train the models used in each of the above steps to produce reliable predictions?

The Training Pipeline — Using existing data to train Deep Voice

The training pipeline for Deep Voice.

Deep Voice uses the training pipeline shown above to train the models used during inference.

Let’s walk through each of the pieces in this pipeline and see how they help us train the overall system. Here’s the data we use for training:

Text, along with recordings of actors voicing out the given text.

Step 1 - Training the Graphene-to-Phoneme Model

The first step in inference is to convert text into phonemes using a Graphene-to-Phoneme model. You may recall this example from the previous post:

  • Input — “It was early spring”
  • Output — [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]

Note that in many cases this system can just pass the text into a phoneme dictionary (like this one from CMU) and return the output!

But what if it sees a new word it hasn’t seen before? This is likely to happen as we constantly add new words to our vocabulary (“google”, “screencast”, etc.). Clearly, we need a fallback that predicts phonemes when we encounter new words.

Deep Voice uses a neural network to accomplish this task. In particular, it leverages the work done by Yao and Zweig at Microsoft Research around Sequence to Sequence (Seq2Seq) learning to predict phonemes for text.

Rather than trying to explain these models in-depth myself, I’m going to point you to some of the best resources I’ve found that explain them:

Quoc Le’s presentation on Sequence to Sequence Learning from the Bay Area Deep Learning School.

Shape of the Data

So what do the training data and the labels for this actually look like?

Input (X — word by word)

  • [“It”, “was”, “early”, “spring”]

Labels (Y)

  • [[IH1, T, .], [W, AA1, Z, .], [ER1, L, IY0, .], [S, P, R, IH1, NG, .]]

We obtain our input and label pairs from a standard phoneme dictionary like this one from CMU.

Step 2 — Running the Segmentation Model

As you may recall, during inference we need to predict both the duration of a phoneme and its fundamental frequency (underlying tone). We can obtain both the duration and the fundamental frequency easily from the audio clip of the phoneme.

The Segmentation model takes in the outputs of the Grapheme-to-Phoneme model and creates training data for the other models in the pipeline.

Deep Voice uses what they call a Segmentation model to get these audio clips of each phoneme.

The Segmentation model matches up each phoneme with the relevant segment of audio where that phoneme is spoken. You can see a high level overview of its inputs and outputs in the figure below:

The Segmentation Model predicts where a phoneme will occur in a given audio clip.

Shape of the Data

What’s particularly interesting about the implementation of the Segmentation model is that instead of predicting the position of individual phonemes, the model actually predicts the position of pairs of phonemes. Additionally, this model is unsupervised as we don’t have ground truth labels of the position of phonemes in the audio clip. It is trained on CTC loss which you can read about here.

Here’s what the data for this model looks like:

Input (X)

  • Audio clip of “It was early spring”
  • Phonemes
  • [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]

Outputs (Y)

  • Pairs of Phonemes with their start time
  • [(IH1, T, 0:00), (T, ., 0:01), (., W, 0:02), (W, AA1, 0:025), (NG, ., 0:035)]

Why predict the position of pairs rather than individual phonemes? When we predict the probability of a given timestamp corresponding to a phoneme, the probability is maximum at the middle of the utterance of that phoneme.

With single phonemes, the probability of the given audio clip being the phoneme spikes in the middle of the phoneme utterance. For pairs of phonemes, the probability of the audio clip being the pair spikes at the boundary of the phonemes.

But, with pairs of phonemes, the probability is a maximum at the position that happens to be their boundary (see above). Hence, using pairs allows us to easily find the boundaries between phonemes.

At the end of this process, we should have a clear idea of where each phoneme occurs in the audio clip.

Step 3— Training Duration and Fundamental Frequency Prediction

In the Inference step, we need to predict durations and fundamental frequencies for a given phoneme.

Now that we have the durations and the fundamental frequencies from the segmentation model, we can now train models to predict both these quantities for new phonemes.

The outputs of the Segmentation model are labels for the duration and fundamental frequency models.

Deep Voice uses a single, jointly trained model that outputs both of these values. Here’s what the shape of the data will look like for training this model.

Shape of the Data

Input (X)

  • Phonemes.
  • [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]

Labels (Y)

  • Durations and Fundamental Frequencies of each phoneme. We get this from the segmentation model.
  • [(IH, 0.05s, 140 hz), (T, 0.07s, 141 hz), … ]

And with that, we’ll be able to carry out duration prediction and F0 prediction!

Step 4— Training Audio Synthesis

Finally, we need to train the piece of our pipeline that actually generates human sounding audio. Much like DeepMind’s WaveNet, this model has the following structure:

For training, we’ll use the ground truth audio clips as labels for our data.

And here’s what the inputs and labels will look like to the model.

Shape of the Data

Input (X)

  • Phonemes w. durations and fundamental frequency.
  • [(HH, 0.05s, 140 hz), (EH, 0.07s, 141 hz), … ]

Labels (Y)

  • Ground truth audio clips for the given text.

With this, we’ve trained all pieces of our pipeline and can successfully run inference.

Summary

Congratulations on making it this far! By now, you’ve seen both how Deep Voice generates new audio and how it is trained in the first place. To summarize, here are the steps to training Deep Voice:

  1. Train the grapheme-to-phoneme model.

Input (X)

  • “It was early spring”

Label (Y)

  • [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]

2. Run the segmentation model.

Input (X)

  • Audio Wave of “It was early spring”
  • Phonemes
  • [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]

Output (Y)

  • Pairs of phonemes with their start time in the audio.
  • [(IH1, T, 0:00), (T, ., 0:01), (., W, 0:02), (W, AA1, 0:025), (NG, ., 0:035)]

3. Train the Duration and F0 Prediction models.

Input (X)

  • Phonemes.
  • [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]

Labels (Y)

  • [(IH, 0.05s, 140 hz), (T, 0.07s, 141 hz), … ]

4. Train the Duration and F0 Prediction models.

Input (X)

  • [(HH, 0.05s, 140 hz), (EH, 0.07s, 141 hz), … ]

Labels (Y)

  • Ground truth audio clips for the given text.

And that’s it! Thank you for taking the time to read through this pair of posts on Baidu’s Deep Voice. If you have any suggestions on how I can make it better feel free to drop a comment and I’ll do my best to improve.

For the next paper, I’ll aim to cover one of the many recent papers applying Convolutional Neural Networks to problems in medical imaging. Keep an eye out for it and see you then!

--

--

@dhruvp. VP Eng @Athelas. MIT Math and CS Undergrad ’13. MIT CS Masters ’14. Previously: Director of AI Programs @ Udacity.