| Speech enhancement is widely used in communication equipment,hearing aids and other products,and can be used as a technology for speech back-end technologies such as speech recognition,speech coding and synthesis,and has high research value.Speech enhancement algorithm based on neural network is one of the important research contents.At the same time,the attention mechanism can better extract the global correlation of features,so that the ability of speech enhancement network can be further improved.Channel attention,self-attention and other methods can enable speech enhancement network to effectively distinguish speech and noise information,so as to better suppress noise.However,the current speech noise reduction method based on attention mechanism still has certain limitations.In order to further improve the ability of monophonic speech enhancement network,this paper has done the following research:(1)Aiming at the problem that the traditional convolutional cyclic neural network has a weak ability to extract speech information under the condition of low signal-to-noise ratio,and the two-stage Transformer network structure based on time domain is more complex,this paper uses Discrete Cosine Transform(DCT),Convert time-domain waveforms to real-number frequency-domain signals,avoiding the trouble of amplitude and phase.At the same time,a two-stage Transformer is used to learn local and global features in frequency domain information,and a Transformer model based on DCT domain is proposed.The noisy speech gets the enhanced speech spectrum through the encoder,the Transformer module and the decoder in turn,and finally the enhanced speech is obtained through the inverse short-time discrete cosine transform.Relevant experiments show that this method has better noise reduction ability than the comparison method in terms of short-term objective intelligibility and speech quality perceptual evaluation under various signal-to-noise ratios.(2)Aiming at the problem that the current speech enhancement algorithms based on attention mechanism do not make full use of the spatial and temporal dimension information in the frequency domain features,this paper combines the separable self-attention with the complex neural network,and adds the complex neural network on the basis of the codec network.Separate from the attention mechanism.Separable self-attention is divided into spatial self-attention and temporal self-attention.Speech features first learn channel and spatial dimension information through spatial self-attention module;then through temporal self-attention module,focus on learning time dimension information.Through targeted learning of the features of each dimension in the spectrum,important features can be effectively extracted,features with less correlation can be suppressed,and parameter redundancy can be reduced.Simulation experiments on Timit and Noise X-92 datasets show that this method has good speech enhancement ability. |