This paper presents FastFit, a novel neural vocoder architecture that replaces the U-Net encoder with multiple short-time Fourier transforms (STFTs) to achieve faster generation rates without sacrificing sample quality. We replaced each encoder block with an STFT, with parameters equal to the temporal resolution of each decoder block, leading to the skip connection. FastFit reduces the number of parameters and the generation time of the model by almost half while maintaining high fidelity. Through objective and subjective evaluations, we demonstrated that the proposed model achieves nearly twice the generation speed of baseline iteration-based vocoders while maintaining high sound quality. We further showed that FastFit produces sound qualities similar to those of other baselines in text-to-speech evaluation scenarios, including multi-speaker and zero-shot text-to-speech.
Comparison with baselines
We adopted LibriTTS, a multi-speaker English dataset which has 24 kHz sampling rate waveforms.
For training models, a ‘train-clean-360’ dataset was used.
For ground-truth mel-spectrogram evaluation (GT mel evaluation) including the ablation study, a ‘test-clean’ dataset was prepared.
Application to text-to-speech synthesis
For multi-speaker TTS, we trained the JDI-T acoustic model using the LibriTTS ‘train-clean-360’ subset with 100 speakers.
For zero-shot TTS evaluation, we used an open-source TTS program named TorToiSe.
The recordings of the LibriTTS ’test-clean’ subset with 10 speakers were input into the program with an ‘ultra-fast’ offset.
Each vocoder was not fine-tuned with these predicted mel-spectrograms.