Submitted to Interspeech 2026
General speech restoration demands techniques that can interpret complex speech structures under various distortions. While State-Space Models (SSMs) like SEMamba have advanced the state-of-the-art in speech denoising, they are not inherently optimized for critical speech characteristics like spectral periodicity or multi-resolution frequency analysis. In this work, we introduce an architecture tailored to incorporate speech-specific features as inductive biases. In particular, we propose Frequency GLP, a frequency feature extraction block that effectively and efficiently leverages the properties of frequency bins. Then, we design a multi-resolution parallel time-frequency dual processing block and a learnable mapping to further enhance model performance. With all our ideas combined, the proposed method SEMamba++ achieves the best performance among multiple baseline models on both seen and unseen datasets. Furthermore, the model is computationally efficient, making it a viable solution in resource-constrained settings.
Figure 1: Overall architecture of SEMamba++ (a) and Frequency GLP (b). In (a), "Mag.", "Comp.", and "Decomp." refer to magnitude, compression, and decompression, respectively. "Conv" and "TrConv" indicate the 2d convolution and transposed convolution that down- and upsample along frequency axis.
In too severe degradation intensity, our method generates well denoised, bandwidth-extended speech, but with wrong phoneme generation.
Table 1. Objective evaluation results on VCTK-GSR test, URGENT 2025 val, and URGENT 2025 test. The table includes the methods and inference steps of models. Reg. denotes the training with regressive loss. Official checkpoints have been utilized for models with * .
Table 2. Efficiency and objective performance evaluation results across degradation categories for AATC Challenge 2025. Official checkpoints have been utilized for models with ∗ .
Table 3: DNSMOS results on DNS 2020 test real recordings focusing on the joint denoising and reverberation. All models are designed for solving general speech restoration. Reg. denotes the training with regressive loss. Official checkpoints have been utilized for models with ∗ , results copied from the original papers are with † , and the results copied from [53] are with ‡ .
Table 4: Comparison and ablation studies on various frequency mixing modules using SEMamba++ as a backbone architecture. GP module refers to the Global Periodicity module.
Table 5: Ablation studies on the proposed multi-branch dual processing block using SEMamba++ as a backbone. x2 and x4 denote downsampling by factors of 2 and 4, respectively
Table 6: Ablation studies on the design choices. LMask., LMap., and Metric. refer to learnable masking, mapping, and MetricGAN, respectively.
Figure 1: Gradient visualization of outputs of different branches in the proposed multi-branch method. (a) and (b) represent the magnitude spectrogram of the clean and the degraded speech, respectively. Resolution-wise visualizations of the gradient-weight magnitude spectrogram are illustrated in (1), (2), and (3). Resolution 1 refers to the top resolution with frequency dimension F′ .
Figure 2: Figure of the ratio of gradient norms under different degradation types and intensity. R over 1 indicates the larger contribution of the Global Periodicity Module compared to the Local module. Degradations of additive noise (Left) and the bandwidth limitation (Right) are illustrated.
Figure 3: Figure of the softplus function with the learnable β for each frequency band. The black dotted line denotes the ReLU function.