Skip to the content.

Flow matching offers a robust and stable approach to training diffusion models. However, directly applying flow matching to neural vocoders can result in subpar audio quality. In this work, we present WaveFM, a reparameterized flow matching model for mel-spectrogram conditioned speech synthesis, designed to enhance both sample quality and generation speed for diffusion vocoders. Since mel-spectrograms represent the energy distribution of waveforms, WaveFM adopts a mel-conditioned prior distribution instead of a standard Gaussian prior to minimize unnecessary transportation costs during synthesis. Moreover, while most diffusion vocoders rely on a single loss function, we argue that incorporating auxiliary losses, including a refined multi-resolution STFT loss, can further improve audio quality. To speed up inference without degrading sample quality significantly, we introduce a tailored consistency distillation method for WaveFM. Experimental results demonstrate that our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders, while enabling waveform generation in a single inference step.

Model

Model

The total amount of parameters is 19.5M. The 128-dim time embedding is expanded to 512-dim after two linear-SiLU layers, and is then reshaped to the desired shape of each resolution. Conv1d and ConvTranspose1d are set with parameters (output channel, kernel width, dilation, padding). In ResBlock Conv1d takes same padding. Each ResLayer is defined with a kernel list and a dilation list, and their cross-product of define the shape of the ResBlock matrix and the convolutional layers of each ResBlock. On the left column are downsampling ResLayers, each containing a 4 x 1 ResBlock matrix, while on the right columns are upsampling ResLayers, each containing a 3 x 3 ResBlock matrix, following the structure from HifiGAN. In each ResBlock the number of channels is unchanged.

For detailed parameter settings please refer to WaveFM/src/params.py.

Audio Samples

All models were trained with 1M steps.

Ground
Truth
WaveFM
(6 steps)
WaveFM
(1 step)
BigVGAN-base
(1 step)
PriorGrad
(6 steps)
DiffWave
(6 steps)
HifiGAN-V1
(1 step)
FreGrad
(6 steps)
FastDiff
(6 steps)
MUSDB18-HQ Mixture 1
MUSDB18-HQ Mixture 2
MUSDB18-HQ Mixture 3
MUSDB18-HQ Mixture 4
MUSDB18-HQ Mixture 5
MUSDB18-HQ Mixture 6
MUSDB18-HQ Bass 1
MUSDB18-HQ Drum 1
MUSDB18-HQ Vocal 1
MUSDB18-HQ Vocal 2
MUSDB18-HQ Vocal 3
MUSDB18-HQ Others 1
LibriTTS Test 1
LibriTTS Test 2
LibriTTS Test 3
LibriTTS Test 4
LibriTTS Test 5
LibriTTS Test 6
LibriTTS Test 7
LibriTTS Test 8
LibriTTS Test 9
LibriTTS Test 10
LibriTTS Test 11
LibriTTS Test 12