SEMamba++: A General Speech Restoration Framework Leveraging Global, Local, and Periodic Spectral Patterns

Yongjoon Lee and Jung-Woo Choi*

Interspeech 2026 (Long paper track)

Abstract

General speech restoration demands techniques that can interpret complex speech structures under various distortions. While State-Space Models (SSMs) like SEMamba have advanced the state-of-the-art in speech denoising, they are not inherently optimized for critical speech characteristics like spectral periodicity or multi-resolution frequency analysis. In this work, we introduce an architecture tailored to incorporate speech-specific features as inductive biases. In particular, we propose Frequency GLP, a frequency feature extraction block that effectively and efficiently leverages the properties of frequency bins. Then, we design a multi-resolution parallel time-frequency dual processing block and a learnable mapping to further enhance model performance. With all our ideas combined, the proposed method SEMamba++ achieves the best performance among multiple baseline models on both seen and unseen datasets. Furthermore, the model is computationally efficient, making it a viable solution in resource-constrained settings.

Model architectures

Figure 1: Overall architecture of SEMamba++ (a) and Frequency GLP (b). In (a), "Mag.", "Comp.", and "Decomp." refer to magnitude, compression, and decompression, respectively. "Conv" and "TrConv" indicate the 2d convolution and transposed convolution that down- and upsample along frequency axis.

Demo

URGENT 2025 test dataset: General speech restoration on real-world recordings

Sample 9

Degraded audio

MP-SENet

SEMamba

USEMamba

Universe++ (50steps)

LLaSE-G1

SEMamba++ (Ours)

Sample 224

Degraded audio

MP-SENet

SEMamba

USEMamba

Universe++ (50steps)

LLaSE-G1

SEMamba++ (Ours)

DNS 2020 test real recordings: Joint denoising and dereverberation on real-world recordings

Sample <audioset_realrec_babycry_2x43exdQ5bo>

Degraded audio

MP-SENet

SEMamba

USEMamba

Universe++ (50steps)

LLaSE-G1

SEMamba++ (Ours)

VCTK GSR test

Sample 421

Degraded audio

Clean audio

MP-SENet

SEMamba

USEMamba

Universe++ (50steps)

LLaSE-G1

SEMamba++ (Ours)

Sample 819

Degraded audio

Clean audio

MP-SENet

SEMamba

USEMamba

Universe++ (50steps)

LLaSE-G1

SEMamba++ (Ours)

Failure case

VCTK GSR test Sample 474

Degraded audio

Clean audio

MP-SENet

SEMamba

USEMamba

Universe++ (50steps)

SEMamba++ (Ours)

In too severe degradation intensity, our method generates well denoised, bandwidth-extended speech, but with wrong phoneme generation.

Experimental Results

Objective evaluation results on VCTK-GSR test, URGENT 2025 val, and URGENT 2025 test

Table 1. Objective evaluation results on VCTK-GSR test, URGENT 2025 val, and URGENT 2025 test. The table includes the methods and inference steps of models. Reg. denotes the training with regressive loss. Official checkpoints have been utilized for models with * .

Objective evaluation results on CCF AATC Challenge 2025

Table 2. Efficiency and objective performance evaluation results across degradation categories for AATC Challenge 2025. Official checkpoints have been utilized for models with ∗ .

Objective evaluation results on DNS 2020 test real recordings

Table 3: DNSMOS results on DNS 2020 test real recordings focusing on the joint denoising and reverberation. All models are designed for solving general speech restoration. Reg. denotes the training with regressive loss. Official checkpoints have been utilized for models with ∗ , results copied from the original papers are with † , and the results copied from [53] are with ‡ .

Ablation studies

On Frequency GLP

Table 4: Comparison and ablation studies on various frequency mixing modules using SEMamba++ as a backbone architecture. GP module refers to the Global Periodicity module.

On multi-resolution parallel TFDP processing

Table 5: Ablation studies on the proposed multi-branch dual processing block using SEMamba++ as a backbone. x2 and x4 denote downsampling by factors of 2 and 4, respectively

From SEMamba to SEMamba++

Table 6: Ablation studies on the design choices. LMask., LMap., and Metric. refer to learnable masking, mapping, and MetricGAN, respectively.

Analysis results

Figure 1: Gradient visualization of outputs of different branches in the proposed multi-branch method. (a) and (b) represent the magnitude spectrogram of the clean and the degraded speech, respectively. Resolution-wise visualizations of the gradient-weight magnitude spectrogram are illustrated in (1), (2), and (3). Resolution 1 refers to the top resolution with frequency dimension F′ .

Figure 2: Figure of the ratio of gradient norms under different degradation types and intensity. R over 1 indicates the larger contribution of the Global Periodicity Module compared to the Local module. Degradations of additive noise (Left) and the bandwidth limitation (Right) are illustrated.

Figure 3: Figure of the softplus function with the learnable β for each frequency band. The black dotted line denotes the ReLU function.

Citation

@inproceedings{lee2026semambageneralspeechrestoration,

title={SEMamba++: A General Speech Restoration Framework Leveraging Global, Local, and Periodic Spectral Patterns},

author={Yongjoon Lee and Jung-Woo Choi},

year={2026},

eprint={2603.11669},

booktitle={Interspeech (Long paper track)},

archivePrefix={arXiv},

primaryClass={eess.AS},

url={https://arxiv.org/abs/2603.11669}

}

Page updated

Google Sites

Report abuse