Enhancing Crowdsourced Audio for Text-to-Speech Models

Listen to the enhancement results of our audio processing pipeline applied to the Catalan split of CommonVoice (version 12). Samples are grouped by NISQA score, as we discovered that enhancement performance varies based on the initial audio quality. You can compare the original and enhanced versions of each sample.

Our enhancement pipeline consists of the following steps:

  1. Format Conversion: Converting MP3 files to WAV format using FFmpeg
  2. Audio Enhancement: Processing through the VoiceFixer enhancement model
  3. Noise Reduction:
    • Extracting the last 0.5 seconds of audio
    • Creating a noise profile from this segment
    • Applying Spectral Noise removal using Sox with the generated profile
  4. Audio Trimming:
    • Removing silence segments longer than 0.1 seconds and below -55dB from both ends
    • Adding 0.1 seconds of silence padding at the beginning and end

NISQA < 2

Original Audio

Restored Audio

NISQA < 2

Original Audio

Restored Audio

NISQA < 3

Original Audio

Restored Audio

NISQA > 3

Original Audio

Restored Audio

NISQA > 4

Original Audio

Restored Audio

NISQA > 4

Original Audio

Restored Audio