Speech separation is the front-end of speech processing system,which directly affects the performance of subsequent speech signal processing system.The separation performance and generalization of traditional speech separation algorithms need to be improved in the environment of low signal-to-noise ratio and high reverberation.In this paper,based on the perceptual characteristics of the human hearing system,combined with spatial features,two binaural speech separation algorithms on deep learning are studied: binaural speech separation algorithm based on soft masking spatial features and Convolutional Neural Network(CNN),and binaural speech separation algorithm based on Gated Convolutional Recurrent Network(GCRN).(1)Binaural speech separation algorithm based on soft masking spatial features and Convolutional Neural Network.This paper is based on Computational Auditory Scene Analysis(CASA),and uses the Gammatone filter bank to simulate the characteristics of human hearing system’s time-frequency analysis,and extracts binaural spatial features from the time-frequency unit,including Cross Correlation Function(CCF),Interaural Time Difference(ITD),and Interaural Level Difference(ILD).The soft masking layer can detect the strong energy region of the speech signal,and weight the spatial feature parameters to reflect the probability that the speech signal is dominant in the noise and reverberation environment,so ITD and ILD are transformed into sm ITD and sm ILD through soft masking layer.The binaural speech separation algorithm based on Deep Neural Network(DNN)is based on a single frame of speech information,ignoring the timing of speech.To solve this problem,this paper uses CNN to take the spatial features of continuous frames as the input,so as to realize the modeling of the timing of these features,and uses Ideal Ratio Mask(IRM)as the training target of the corresponding azimuth to realize the mixed speech separation based on spatial information.The speech separation quality evaluation indicators in this paper are Sources to Artifacts Ratio(SAR),Source to Distortion Ratio(SDR),Source to Interferences Ratio(SIR)and Perceptual Evaluation of Speech Quality(PESQ).Simulation results show that this algorithm is better than separation algorithm based on DNN and separation algorithm based on CNN without soft masking features.(2)Binaural Speech Separation Algorithm Based on Gated Convolutional Recurrent Network.Due to the limitation of the receptive field,CNN cannot capture a wider range of features without increasing the amount of parameters.At the same time,it is necessary to introduce a network structure that is better to model time-series signals such as speech.In this paper,CNN is improved,and combined with the gated recurrent unit Gated Recurrent Unit(GRU).Therefore,GCRN is proposed to realize binaural speech separation.The improved CNN network adopts gating mechanism and residual structure,which helps network training and accelerates network convergence;at the same time,the convolutional layer uses dilated convolution to expand the receptive field with small kernel size.GRU is a kind of Recurrent Neural Network(RNN),which can solve the problem of gradient disappearance and gradient explosion in long sequence training.With the simple internal structure,GRU achieves faster training speed.GCRN combines the high-dimensional feature extraction capability of CNN and the time-series modeling capability of RNN.The simulation results show that the performance of binaural speech separation algorithm based on GCRN has improved in various evaluation indicators compared with the separation algorithm based on CNN and the separation algorithm based on GRU,and its separation effect and generalization are both improved. |