| Speech is a highly efficient carrier of information interaction and plays an important role in communication and human-computer interaction.Speech enhancement aims to recover clean speech from the received mixtures contaminated by noise and reverberation,improving speech quality and intelligibility.Deep neural networks(DNN),with their powerful nonlinear modeling ability,have enabled speech enhancement algorithms to achieve significant performance gains in scenarios with non-stationary noise and low signal-to-noise(SNR)ratio scenarios.Despite the success of DNN-based speech enhancement methods,most of the existing approaches employ single-stage scheme to estimate the clean speech spectrum,leading to the problem of mutual interference between estimation targets,as observed in two specific cases: 1)wideband speech enhancement methods typically utilize a single-stage network to estimate the real and imaginary components of the clean complex spectrum,thus implicitly recovering the spectral magnitude and phase information of clean speech.However,simultaneous optimizing the real and imaginary components of the complex spectrum often suffers from the compensation effects between the spectral magnitude and phase,resulting in performance degradation at low SNRs;2)existing fullband speech enhancement algorithms typically estimate the full-band speech spectrum in a singlestage manner,ignoring the differences in energy,pitch,and harmonics between the low-frequency and high-frequency bands of speech.This can lead to mutual influence between different frequency bands during the estimation process,limiting the performance of the algorithm.To address the aforementioned limitation of single-stage SE methods,the basic idea of this dissertation is to decompose the target of speech enhancement and we propose the multi-stage algorithms based on target decomposition for the wideband speech enhancement task and fullband speech enhancement task.The key points lie in how to decompose the target and how to utilize the information of decomposed sub-targets for joint optimization.The main contributions are summarized as follows:1)A single-channel dual-branch federative magnitude and phase estimation speech enhancement method is proposed.Since the phase is highly unstructured and difficult to estimate directly,we decompose the spectral estimation into the magnitude spectrum and the residual estimation of the complex spectrum,and two core branches are elaborately designed to recover the overall spectrum.In the magnitude spectrum estimation branch(MEB),we seek to construct the filtering system to coarsely suppress the dominant noise components in the magnitude domain,and keep the original phase unaltered.In the complex spectrum purification branch(CPB),we aim to estimate the residual real and imaginary components of the complex spectrum and repair the fine-grained spectral structures which may be lost in the MEB path,while the phase information is implicitly recovered.Within each branch,a novel "Attention-InAttention" Transformer-based network is proposed to funnel the global sequence modeling process,which can capture long-range dependencies during the feature learning procedure.Information interaction modules are employed between each branch,which aim at flowing information and facilitate the overall spectrum recovery from the complementary perspective.Experimental results on two public datasets demonstrate that the proposed method can significantly improve speech quality and speech intelligibility,in terms of PESQ,ESTOI and SDR.2)An unsupervised speech enhancement method based on the joint optimization of spectral magnitude and phase is proposed.For the lack of adequate paired noisy-clean speech corpus in many real scenarios,it is intractable to conduct speech enhancement by traditional supervised methods.In this dissertation,the spectral estimation is decomposed and a novel Cyclein-Cycle Generative Adversarial Network(Cin CGAN)is proposed for non-parallel speech enhancement,which estimates the complex spectrum of clean speech step by step in a multi-stage training manner.In the first stage,due to the difficulty of simultaneously estimating spectral magnitude and phase under non-parallel training,a magnitude spectrum estimation Cycle GAN(MCGAN)is pretrained to only estimate the spectral magnitude of clean speech and ignore phase recovery.In the second stage,the pretrained MCGAN is incorporated into a complex spectrum refined Cycle GAN(CCGAN)as Cycle-in-Cycle GAN to refine the overall complex spectrum and implicitly recover the clean phase.Experimental results show that the method significantly outperforms previous baselines under both standard supervised training and unsupervised training,especially in reducing background noise(CBAK)and speech distortion(COVL).3)A coordinated sub-band decomposition-and-fusion method is proposed for full-band speech enhancement.Since the frequency bands of full-band speech is large and the spectral characteristics of each band differ greatly,we split the full-band spectrum into low-band(0-8k Hz),middle-band(8-16 k Hz)and high-band(16-24 k Hz),and three sub-networks are meticulously devised to cope with them in a step-wise manner.First,as speech contains more harmonics and semantic information in the frequency range of 0 to 8 k Hz,a collaborative dualstream network is proposed to eliminate the noise and recover the clean complex spectrum in the low-band regions.Then,based on the fact that the frequency bands ranging from 8-24 k Hz tend to contain less speech information,two cascaded light-weight magnitude masking-based subnetworks are employed to suppress the middle-and high-band noise in the magnitude domain.To strengthen the information interaction,a sub-band interaction module is proposed to guide the mask estimation in the middle-and high-band networks.Finally,the estimated low-,middle-and high-band spectra are fused to obtain the clean full-band signal.Comprehensive experiments on two public benchmarks well validate that in terms of objective metrics such as perceptual speech quality(PESQ),speech intelligibility(STOI)and signal-to-noise ratio(SSNR),compared with the single-stage full-band speech enhancement algorithms,the proposed method can consistently achieve better performance,especially in reducing speech distortion and improving overall quality.In summary,follow the idea of objective decomposition,this dissertation proposes a multistage speech enhancement algorithm based on target decomposition.Experimental results show that the proposed method achieves more competitive performance than existing wide-band and full-band speech enhancement algorithms based on single-stage target estimation on several public datasets. |