| The purpose of speech environment noise reduction is to suppress the environmental noise in noisy speech signals as much as possible and improve the quality and intelligibility of speech signals.It has been widely used in the fields of hearing aids,intelligent interactive systems and military communications.In the past few decades,many traditional noise reduction methods have been proposed by scholars.However,these traditional noise reduction methods usually need to make many assumptions about the signal,so the noise reduction performance is poor in low SNR and non-stationary noise,resulting in more residual noise,speech distortion and quality degradation.In recent years,with the increasing development of deep learning,noise reduction methods based on deep learning have been widely studied.The noise reduction method based on deep learning directly simulates the complex nonlinear relationship between the noisy speech and the corresponding pure speech through the powerful learning ability of neural network without any assumptions,it has been proved that it has achieved much better performance than the traditional noise reduction method.This paper mainly studies three classical traditional noise reduction methods and two different methods based on deep learning.The specific research contents are as follows:(1)This paper introduces three classical traditional noise reduction methods: spectral subtraction,wiener filtering and the optimally modified log-spectral amplitude estimation method,carries out relevant simulation experiments,and analyzes and compares their advantages and disadvantages.(2)This paper studies a noise reduction method based on feature mapping.Its basic processes including data preparation,feature extraction,model building and training and waveform reconstruction are introduced in detail.The training data sets and test data sets of Chinese and English languages are constructed,and their amplitude spectrum features are extracted.Four neural network models are built: deep neural network model,deep recurrent neural network model,u-net model and convolution recurrent neural network model,which are trained in English and Chinese training sets respectively.Finally,the three traditional noise reduction methods and the above four models are tested in English and Chinese test sets respectively,and their advantages and disadvantages are analyzed and compared.The experimental results show that the noise reduction method based on feature mapping canobtain better performance than the traditional noise reduction methods in both English and Chinese test sets.In the noise reduction model based on feature mapping,the convolution recurrent neural network uses a two-dimensional convolution layer to extract more accurate local features from the amplitude spectrum features of the input speech signal,and also uses the jump connection operation for feature contact to generate multi-scale features for predicting the amplitude spectrum of the pure speech,In addition,the long short-term memory network is used to learn the timing characteristics of speech signals,so the optimal noise reduction performance is obtained,which reflects the importance of the advantages of neural network structure for the task of noise reduction in speech environment sufficiently.(3)This paper studies an end-to-end speech environment noise reduction method.The basic steps of end-to-end noise reduction method and the noise reduction method based on WaveU-Net model are introduced.According to the shortcomings of wave-u-net model,three improvements are proposed: firstly,in order to avoid the disappearance of gradient and extract deeper abstract features by deepening the network depth,the residual unit is introduced into Wave-U-Net model to replace the ordinary one-dimensional convolution layer in the coding layer and decoding layer of the original model;Secondly,In order to solve the semantic gap easily caused by Wave-U-Net model in feature contacting,the channel attention mechanism is introduced.At the same time,in order to enhance or suppress the noisy speech signals according to their importance in different time dimensions,spatial attention mechanism is also introduced into the model,and a model based on mixed domain attention mechanism is obtained;Finally,in order to increase the receptive field,the atrous spatial pyramid pooling module is introduced to replace the ordinary middle onedimensional convolution layer of the original model,so that the model can obtain the multiscale characteristics of different receptive fields while increasing the receptive field,so as to further improve the noise reduction performance of the system.Experimental results show that the above three improvements can improve the quality and intelligibility of the noisy speech signal after noise reduction in both Chinese and English languages,making the model more suitable for noise reduction in speech environment. |