| Speech Enhancement refers to the process of suppressing noise in noisy speech.The content of this article is the suppression of single-channel additive noise.Traditional speech enhancement methods generally suppress noise by constructing filters.But in the real world,noise is often an unstable signal,that is,the power spectral density of noise is not constant.When traditional speech enhancement methods perform noise estimation,it is difficult to accurately estimate the power spectral density of the noise,which will result in a residual noise phenomenon after the noise is suppressed.Although white noise is a stationary noise signal,traditional speech enhancement algorithms still have errors when estimating the power spectral density of white noise,which will lead to the appearance of "music noise".This paper studies an end-to-end speech enhancement method based on Fully Convolutional Networks(FCN)and Long-short Term Memory Networks(LSTM).The neural network model is a deep learning method.This supervised and large-data learning model is more computationally accurate than traditional speech enhancement methods.Based on the baseline model of the FCN and the LSTM,this paper proposes the following optimization schemes according to the nature and characteristics of the speech enhancement task:1.Propose an end-to-end FCN speech enhancement model based on optimized data adaptive neuron activation and adaptive threshold Huber loss function.The data adaptive activation can automatically adjust the activation step point through the control function,but the expression of the control function has limitations,so this article makes two optimizations:(1)The calculation dimension of the control function is changed from batch dimension to sequence dimension,(2)The data distribution offset item is added when calculating the control function.In the time-domain signal of noisy speech,assuming that the value of the noise sample point at a certain moment is much greater than the value of the speech sample point,the sample point at this time is called an abnormal sample point.In order to prevent abnormal sample points from affecting model learning,this paper proposes to use Huber loss function.Setting the threshold of the Huber loss function as a trainable parameter,and automatically updating the threshold during model training,so that the model automatically adjusts the loss function,and prevents the model from overfitting to the abnormal sample points when there are abnormal sample points.2.Propose an end-to-end LSTM speech enhancement model based on the layer normalization and the Huber loss function with adaptive threshold.The LSTM model is a variant of the recurrent neural network model.In order to further reduce the risk of gradient disappearance or gradient explosion of the LSTM model,this paper proposes the following optimization schemes: In LSTM,a sequence normalization module is added after the memory cell update module at each moment.The purpose is to unify the time series data from a scattered distribution to a Gaussian distribution,and try to keep the data away from the gradient saturation region of the activation function;At the same time,in order to prevent abnormal sample points from affecting the model learning,the loss function uses the Huber loss function based on an adaptive threshold.3.Propose a speech enhancement model based on optimized end-to-end LSTM and SelfAttention.The hidden vector of the optimized LSTM model proposed in point 2 learns a lot of information through model training,but the feedforward neural network does not fully mine this information.Therefore,this paper proposes an optimization scheme,using the optimized LSTM model as the encoder to learn the hidden state encoding of the speech sequence;The Self-Attention model acts as a decoder,which uses the hidden vector information of all sample points in the sequence to enhance the expression ability of the sample points that the model is currently paying attention to,and learns the attention score of each sample point in the speech hidden state sequence.4.Through experiments,it is found that the fitting ability of the proposed optimization model on new types of noise needs to be improved.Therefore,this paper proposes a "music noise" suppression model based on feature fusion,and uses traditional speech enhancement methods as weak learners to learn speech signals mixed with "music noise".Then add this feature to the original data set to form a new data set,and finally use the neural network model to train it.The purpose is to make the neural network model learn the characteristics of "music noise" to facilitate suppression.Finally,the effectiveness of the optimization scheme proposed above is verified through experiments,and the effect evaluation method uses two evaluation functions,PESQ and STOI.The evaluation results show that the optimization scheme can improve the effect of speech enhancement based on the FCN and LSTM baseline model to a certain extent,and the ability to suppress "music noise" and other non-stationary noise is significantly better than traditional speech enhancement methods. |