Skip to the content.

FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs

Anonymous submission to INTERSPEECH 2023

Table of contents

Abstract

This paper presents FastFit, a novel neural vocoder architecture that replaces the U-Net encoder with multiple short-time Fourier transforms (STFTs) to achieve faster generation rates without sacrificing sample quality. We replaced each encoder block with an STFT, with parameters equal to the temporal resolution of each decoder block, leading to the skip connection. FastFit reduces the number of parameters and the generation time of the model by almost half while maintaining high fidelity. Through objective and subjective evaluations, we demonstrated that the proposed model achieves nearly twice the generation speed of baseline iteration-based vocoders while maintaining high sound quality. We further showed that FastFit produces sound qualities similar to those of other baselines in text-to-speech evaluation scenarios, including multi-speaker and zero-shot text-to-speech.

Comparison with baselines

Application to text-to-speech synthesis

Multi-speaker text-to-speech

Models #1 #2 #3 #4
UnivNet
FastDiff
WaveFit
FastFit
FastFit (U-Net)

Zero-shot text-to-speech

Models #1 #2 #3 #4
UnivNet
FastDiff
WaveFit
FastFit
FastFit (U-Net)

Ground truth mel-spectrogram

Models #1 #2 #3 #4
Recordings
UnivNet
FastDiff
WaveFit
FastFit
FastFit (U-Net)

Ablation studies

Models #1 #2 #3 #4
Recordings
FastFit
FastFit (U-Net)
Without AdaLN
Without skip-connections
Initial point from spectral envelope
Initial point from Griffin-Lim
Magnitude STFT encoder