Spectral Codecs: Improving Non-Autoregressive Speech Synthesis with Spectrogram-Based Audio Codecs

Model Checkpoints

We release optimized checkpoints for our codecs.

Audio FSQ 22kHz
Audio FSQ 44kHz

Spectral FSQ 22kHz
Spectral FSQ 44kHz

Codec Reconstruction

We compare the reconstructed audio of different codec models after compression, in multiple languages.

English

Ground Truth


EnCodec (24kHz)

Descript Audio Codec


Audio RVQ codec

Audio FSQ codec

HiFi-GAN

Spectral RVQ codec

Spectral FSQ codec


Spanish

Ground Truth


EnCodec (24 kHz)

Descript Audio Codec


Audio RVQ codec

Audio FSQ codec

HiFi-GAN

Spectral RVQ codec

Spectral FSQ codec


French

Ground Truth


EnCodec (24 kHz)

Descript Audio Codec


Audio RVQ codec

Audio FSQ codec

HiFi-GAN

Spectral RVQ codec

Spectral FSQ codec


TTS Synthesis

A comparison of synthesized speech from FastPitch when trained with different codec models. For our best performing codecs, we also provide samples from an autoregressive FastPitch model.

Example 1

Ground Truth


EnCodec (24kHz)

Descript Audio Codec


Audio RVQ codec

Audio FSQ codec

HiFi-GAN

Spectral RVQ codec

Spectral FSQ codec


Audio RVQ codec
(autoregressive)

Audio FSQ codec
(autoregressive)

Spectral FSQ codec
(autoregressive)


Example 2

Ground Truth


EnCodec (24kHz)

Descript Audio Codec


Audio RVQ codec

Audio FSQ codec

HiFi-GAN

Spectral RVQ codec

Spectral FSQ codec


Audio RVQ codec
(autoregressive)

Audio FSQ codec
(autoregressive)

Spectral FSQ codec
(autoregressive)