Font Size: a A A

Research On Monaural Speech Separation Of Specific Speaker Based On Deep Learning

Posted on:2021-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y X ZhangFull Text:PDF
GTID:2428330629452986Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
The purpose of speech separation is to separate the target speech signals of interest from the mixed speech signals.It has important research significance and application value in the fields of speech recognition,intelligent home and criminal investigation information retrieval.Traditional mono speech separation technology often needs to make some independent assumptions between the source signals,ignoring the temporal correlation of the speech signals,and due to the limitations of model structure and scale,the separation performance of the system is not ideal.In recent years,deep learning technology has made major breakthroughs in image segmentation,speech recognition,text classification and other fields,which provides a new solution for speech separation.Aiming at the problems existing in traditional speech separation technology,this paper conducts the following researches on mono channel speech separation task based on deep learning technology:(1)Considering that speech signals have temporal correlation and Recurrent Neural Network(RNN)has a natural advantage in modeling time series,this paper designs a separation model based on RNN to realize the separation of specific speaker in the spectral domain.Aiming at the separation of specific speakers,the data set is constructed by the non-specific speaker speech and the specific speaker voice without overlapping for the training of network model.In addition,considering the long-term dependency of the standard RNN,a separation model based on Long Short-Term Memory(LSTM)and Bi-directional Long Short-Term Memory(BLSTM)were constructed respectively.The three network models use the same network parameters,and the experimental results show that the BLSTM model has better separation performance and generalization ability than the RNN and LSTM models.Finally,based on the BLSTM model,the optimal separation effect was achieved by optimizing relevant parameters,and the overall separation performance index SDR reached 8.82 dB.(2)Considering that the BLSTM model used above still performs speech separation in the spectral domain,the phase of mixed speech is used to estimate the speech of the target speaker when reconstructing the time domain speech signal,which inevitably has a negative impact on the speech separation.Therefore,an improved speech separation model in the time domain based on U-Net network is designed.The most important features of U-Net network are the encoder-decoder structure and the skipping connection fusion layer,which can extract and fuse the multi-scale features of the input speech signal in time domain.Since the speech waveform in time domain is a one-dimensional sequence,the convolution of the original U-Net network into one dimension is convenient for feature extraction of the speech waveform.In order to make full use of the context information and avoid the loss of endpoint information,edge padding is carried out for the input data before convolution.In addition,the depth of the network is increased to obtain a larger receptive field to extract deeper features.On the basis of the improved network,the input parameters and the number of network layers were further adjusted to obtain a better separation model,and the SDR index finally reached 10.27 dB.Finally,the overall separation performance and generalization performance of the improved U-Net time-domain separation model and the separation model based on BLSTM on spectral domain are compared on the same sex speaker,the opposite sex speaker and the untrained speaker.The experimental results show that compared with the BLSTM separation model,the improved U-Net time-domain separation model improves the SDR index by 1.45 dB in the heterosexual speaker test and 1.69 dB in the homosexual speaker test.In the test of untrained speakers,the indicators were also basically higher than the BLSTM model.The above results show that the improved U-Net time domain separation network improves the overall separation performance and generalization ability obviously,which proves the effectiveness of the improved U-Net time domain separation method.
Keywords/Search Tags:Mono Channel, Speech Separation, Specific Speaker, BLSTM, U-Net, Time Domain Speech Signal
PDF Full Text Request
Related items