| Speech is a way of conveying information in everyday life,and with the rise of artificial intelligence,the requirements for speech quality in human-computer interaction have become more stringent.In real-life environments,the reception of speech is often accompanied by complex noise signals,which seriously affects the transmission of speech signals.The aim of speech enhancement is to separate the noise from the target speech and improve the intelligibility and quality of the speech signal.Traditional speech enhancement methods are usually effective in the case of smooth noise,but are less effective in dealing with non-smooth noise in real-world environments.Speech enhancement methods based on deep learning have shown some improvement in improving speech intelligibility and algorithmic robustness.However,the following problems still exist: the common recurrent neural networks cannot achieve parallel processing because of the restricted computational approach and the need to consider the input of the network in the output;the inability to capture the local and long-term dependencies of speech sequences and to model the location information of speech signal sequences;the upper limit of speech enhancement technology research due to the mismatch of phase information in the speech spectrum caused by considering only the amplitude spectrum input features,etc.(1)The use of a codec structure avoids the phase mismatch caused by the use of noisy phases;the use of the Transformer implements parallelised sequential computation for the gradient disappearance and explosion problem during training;the use of four stacked dual-path Transformer modules allows for the processing of thousands of speech sequences that are not possible with the traditional Transformer model;and finally the use of a local recurrent The neural network structure replaces the extra positional embedding structure in the Transformer model,capturing the sequential information of the local positions of the input sequences and efficiently extracting local and global information.Experimental results on the Voice Bank corpus and DEMAND datasets show that the model proposed in this paper outperforms most existing models in the time domain or time-frequency domain.(2)A speech enhancement method based on a two-path Transformer model with low signal-to-noise ratio progressive learning is proposed.Firstly,a lightweight neural network with three different sub-network structures is constructed,and different target speech is outputted through each sub-network structure to train from small simple task learning to complex task learning,which in turn improves the generalisation of the model.Experimental results on the TIMIT and NOISE-92 datasets show that at low signal-to-noise ratios,the proposed model achieves higher evaluation metric scores than other speech enhancement models,and the number of parameters in the algorithmic model is smaller. |