Skip to the content.

Table of contents

Abstract

Image
Image
We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains. To preserve sound quality when the MelGAN-based structure is trained with a dataset of hundreds of speakers, we added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms. This enables the model to generate realistic waveforms of multi-speakers, by alleviating the over-smoothing problem in the high frequency band of the large footprint model. Our structure generates signals close to ground-truth data without reducing the inference speed, by discriminating the waveform and spectrogram during training. The model achieved the best mean opinion score (MOS) in most scenarios using ground-truth mel-spectrogram as an input. Especially, it showed superior performance in unseen domains with regard of speaker, emotion, and language. Moreover, in a multi-speaker text-to-speech scenario using mel-spectrogram generated by a transformer model, it synthesized high-fidelity speech of 4.22 MOS. These results, achieved without external domain information, highlight the potential of the proposed model as a universal vocoder.

Korean samples

For Korean, each model was trained on studio-quality internal datasets with 62 speakers and 265k utterances.
‘Seen’ indicates that the domain has been trained.
‘Unseen’ indicates that the domain has never been trained.

Seen speakers

Single speaker

Index Recording Universal MelGAN FB-MelGAN WaveRNN WaveGlow
#1
#2
#3

Multiple speakers

Index Recording Universal MelGAN FB-MelGAN WaveRNN WaveGlow
#1
#2
#3

Unseen domains: speaker, emotion, language

Unseen speakers

Index Recording Universal MelGAN FB-MelGAN WaveRNN WaveGlow
#1
#2

Expressive utterances

Index Recording Universal MelGAN FB-MelGAN WaveRNN WaveGlow
Sportscasting
Anger
Disgust
Fear
Happiness
Sadness

Unseen languages

Index Recording Universal MelGAN FB-MelGAN WaveRNN WaveGlow
Spanish
German
French
Japanese
Chinese

Multi-speaker text-to-speech

To evaluate this scenario, we trained the JDI-T acoustic model with a pitch and energy predictor using a dataset with four speakers.
Each trained vocoder was fine-tuned by 100k steps using a pair of the ground-truth waveforms and the predicted mel-spectrograms.
Note that we prepared the predicted mel-spectrograms of JDI-T by using the text, reference duration, ground-truth pitch, and energy.

Index Universal MelGAN FB-MelGAN WaveRNN WaveGlow
#1
#2
#3
#4

English samples

For English, each model was trained on LJSpeech and LibriTTS datasets with 905 speakers and 129k utterances.
‘Seen’ indicates that the domain has been trained.
‘Unseen’ indicates that the domain has never been trained.

Seen speakers

Single speaker

Index Recording Universal MelGAN FB-MelGAN WaveRNN WaveGlow
#1
#2
#3

Multiple speakers

Index Recording Universal MelGAN FB-MelGAN WaveRNN WaveGlow
#1
#2
#3

Unseen domains: speaker, emotion, language

Unseen speakers

Index Recording Universal MelGAN FB-MelGAN WaveRNN WaveGlow
#1
#2

Expressive utterances

Index Recording Universal MelGAN FB-MelGAN WaveRNN WaveGlow
#1
#2
#3
#4
#5
#6

Unseen languages

Index Recording Universal MelGAN FB-MelGAN WaveRNN WaveGlow
Spanish
German
French
Japanese
Chinese

Text-to-speech

We trained the single-speaker Tacotron 2 acoustic model with the LJSpeech dataset.

LJSpeech

Index Universal MelGAN
#1
#2
#3

Additional study

Korean TTS samples: Predicted mel-spectrogram of unseen speakers

Trained but not fine-tuned speakers

Index Universal MelGAN
#1
#2
#3

Not trained and not fine-tuned speakers

Index Universal MelGAN
#1
#2

Korean TTS samples: Multi-band + mixed precision

This study achieved 0.003 RTF on an NVIDIA V100 GPU.

Trained and fine-tuned speakers

Index Universal MelGAN
#1
#2
#3

Trained but not fine-tuned speakers

Index Universal MelGAN
#1
#2
#3

Not trained and not fine-tuned speakers

Index Universal MelGAN
#1
#2