In daily life,due to the existence of interference sources and various kinds of noise,the target speech is seriously polluted,and the PESQ,STOI,SNR and other indicators of the target speech are sharply reduced,which seriously affects the accuracy of the back-end speech recognition,and also brings people a bad hearing experience.Speech separation technology is an important part of speech signal processing algorithm.The task of speech separation is to separate the speech from the overlapping audio that people are interested,remove the interference and noise as much as possible to improve the STOI,SNR and PESQ of the target speech and so on.Speech signals are temporal signals,and recurrent neural networks can effectively model the temporal features of signals,while convolutional neural networks can effectively extract the structural features of spectrum.Therefore,this paper proposes a convolution gated recurrent neural network for causal speech separation by combining convolution neural network and gated recurrent neural network.Aiming at the performance of convolutional neural network in speech separation task is limited due to the fixed size of receptive field,a target speech separation method based on multi-scale feature fusion is proposed.The main contributions are as follows:(1)A convolutional gated recurrent neural network for single-channel causal speech and noise separation is proposed to address the performance decreases of the model when the input of the separation model is causal input.This network combines the advantages of recurrent neural network and convolution neural network in speech separation.By using convolution operation substitute for the matrix product of the full connection in the recurrent neural network.It effectively retains the spectrum structure characteristics of speech,improves the PESQ,SSNR and STOI of the separated speech,and reduces the number of parameters of the model.In addition,the output result of the network unit at the current time is determined by the input of the current time and the input and output of the previous time,which greatly utilizes the characteristic information of the causal input.The design of the network model improves the performance of the model for single-channel speech and noise separation under the condition of causal input.(2)Because the fixed-size receptive field limits its performance in speech separation for convolutional neural network,a multi-channel target speech separation model based on multiscale feature fusion is proposed.This model extracts multi-scale speech signal features and directional features by using group convolution and dilated convolution,while reducing the number of model parameters.This model greatly improves the performance of convolutional neural network in target speech separation.In addition,in order to modeling the temporal characteristics of the speech and improve the quality of target speech,temporal convolutional network(TCN)is used to improve the temporal modeling ability of the model.The separated target speech has greatly improved in PESQ,STOI,SI-SDR and other indicators.Through the test on the open datasets and the datasets generated by the open datasets,this paper proves that the proposed method has greatly improved the STOI,PESQ,SSNR and other indicators compared with the traditional network structure.In addition,the number of parameters has also been optimized and reducing parameters of the speech separation model. |