LLaSE-G1: Maximizing Acoustic Preservation for LLaMA based Speech Enhancement

Abstract

Recent advancements in language models (LMs) have demonstrated strong capabilities in semantic understanding and contextual modeling, which have flourished in generative speech enhancement (SE). However, many LM-based SE approaches primarily focus on semantic information, often neglecting the critical role of acoustic information, which leads to acoustic inconsistency after enhancement and limited generalization across diverse SE tasks. In this paper, we introduce LLaSE-G1, a LLaMA-based language model that incentivizes generalization capabilities for speech enhancement. LLaSE-G1 offers the following key contributions: First, to mitigate acoustic inconsistency, LLaSE-G1 employs continuous representations from WavLM as input and predicts speech tokens from X-Codec2, maximizing acoustic preservation. Second, to promote generalization capability, LLaSE-G1 introduces dual-channel inputs and outputs, unifying multiple SE tasks without requiring task-specific IDs. Third, LLaSE-G1 outperforms prior task-specific discriminative and generative SE models, demonstrating scaling effects at test time and emerging capabilities for unseen SE tasks. Additionally, we release our code and models to support further research in this area.

Figure 1: Overview of the LLaSE-G1 framework. LLaSE-G1 greatly simplifies the model structure, keeping three main components: (1) a WavLM encoder (2) a LLaMA-based LM and (3) an X-codec2 decoder.

Video & Audio Demo

Video Samples

Before Processing

After Processing

Noise Suppression (NS)

Noise suppression aims to remove unwanted background noise from a speech signal, enhancing the clarity of the target speech. Below are demos from the Interspeech 2020 DNS Challenge Blind Testset. Each line contains four audio files: the original clean WAV, the noisy WAV, the WAV processed by SELM, and the WAV processed by our system.

Clean

Noisy

SELM

LLaSE-G1

Clean

Noisy

SELM

LLaSE-G1

Clean

Noisy

SELM

LLaSE-G1

Clean

Noisy

SELM

LLaSE-G1

Clean

Noisy

SELM

LLaSE-G1

Clean

Noisy

SELM

LLaSE-G1

Clean

Noisy

SELM

LLaSE-G1

Clean

Noisy

SELM

LLaSE-G1

Clean

Noisy

SELM

LLaSE-G1

Clean

Noisy

SELM

LLaSE-G1

Target Speaker Extraction (TSE)

Target speaker extraction involves isolating the speech of a specific speaker from a mixture of multiple speakers. Below are demos from the 2023 DNS Blind Testset. Each line contains three audio files: the original noisy WAV, the enrollment WAV, and the WAV processed by our system.

Noisy

Enrollment

LLaSE-G1

Noisy

Enrollment

LLaSE-G1

Noisy

Enrollment

LLaSE-G1

Noisy

Enrollment

LLaSE-G1

Noisy

Enrollment

LLaSE-G1

Noisy

Enrollment

LLaSE-G1

Noisy

Enrollment

LLaSE-G1

Packet Loss Concealment (PLC)

Packet loss concealment addresses the issue of missing audio frames in speech communication, using models to predict and fill in the lost information. Below are demos from the 2024 PLC Blind Testset. Each line contains two audio files: the original lossy WAV and the WAV processed by our system. Notebly, our system DO NOT need extra lossy mask label.

Lossy

LLaSE-G1

Lossy

LLaSE-G1

Lossy

LLaSE-G1

Lossy

LLaSE-G1

Lossy

LLaSE-G1

Acoustic Echo Cancellation (AEC)

Acoustic echo cancellation aims to eliminate echoes from the received audio signal, with a reference signal provided. Below are demos from the 2023 AEC Blind Testset. Each line contains three audio files: the original noisy WAV, the reference WAV, and the WAV processed by our system.

Noisy

Reference

LLaSE-G1

Noisy

Reference

LLaSE-G1

Noisy

Reference

LLaSE-G1

Noisy

Reference

LLaSE-G1

Noisy

Reference

LLaSE-G1

Noisy

Reference

LLaSE-G1

Noisy

Reference

LLaSE-G1

Zero-Shot Speech Separation (SS)

Speech separation involves isolating individual speech signals from a mixture of multiple speakers. We treat Speech Separation as an unseen task while in trainning. Below are demos from the LibriMix and WSJ02 testsets. Each line contains three audio files: the original mixed WAV, the first and the second speaker’s WAV separated by our system.