Font Size: a A A

Research On Supervised Speech Separation Based On Deep Learning

Posted on:2019-11-05Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhangFull Text:PDF
GTID:2428330542496720Subject:Control engineering
Abstract/Summary:PDF Full Text Request
Speech separation technology has been an important research direction in the field of speech signal processing.Especially with the popularity of smart devices in recent years,a good front-end speech separation module affects the development of speech interaction technology directly.However,due to the complexity of the external environment and the limited ability of the early shallow models to deal with speech non-linear structural information,the speech separation performance in the condition of single-channel,low signal-to-noise ratios(SNRs),and non-stationary noise environment has exhibited unsatisfactory results.With the development of deep learning,the deep models are very suitable for mining structural information hidden in speech data ascribed to their inherent multilayer nonlinear structure,which can learn deep abstract features automatically.Therefore,it is of great significance to apply deep learning to speech separation.This thesis focuses on the problem of supervised speech separation under the condition of single-channel and non-stationary noise.The deep model is employed to excavate the nonlinear mapping relationship between the original noisy speech and the clean target speech.The main work of this thesis is as follows:Firstly,in this thesis,the deep neural networks(DNNs)based speech separation method is studied and implemented.Aiming at the defects of the existing DNN model,we propose an improved model called C-DNN,which is realized by adding a one-dimensional convolutional layer(including the pooling layer)in front of the general DNN network structure.The proposed C-DNN model uses the previous convolutional layer to model the frame-level features directly,based on which,the abstract features can be automatically learned from speech data through the one-dimensional convolution preprocessing operation.Then the latter fully-connected layers of C-DNN are used to find the nonlinear mapping relationship between the features of noisy speech and the ideal target.The proposed model can make full use of the correlation information between adjacent frequency bands in each time frame,and reduce the difficulty of feature extraction and the dimension of input features.In this thesis,we test the separation performance of the proposed C-DNN model based on the single frame feature of speech,and compare it with the DNN model under the same depth.The input features are 64-channel gammatone features(GF)and their first-order delta,and the training target is the ideal binary mask(IBM).Experimental results show that the speech separation effect of C-DNN with fewer parameters is significantly better than that of DNN under the same experimental conditions,which means that C-DNN can obtain a target speech with better intelligibility and perceptual quality.Moreover,several experiments with different SNRs and background noise are conducted to verify the generalization ability of the proposed method.Furthermore,we propose a convolutional neural network(CNN)based speech separation method,which can utilize the essential advantages of CNN to process the two-dimensional signal and the powerful feature learning ability,to excavate the space-time structure information from the speech time-frequency signal.The proposed method can make full use of the time-frequency correlation and local features of the speech signal by modeling the contextual features of the speech spectrum,which is conductive to improving the speech separation performance.We verify the CNN-based speech separation method by experimenting on the speech features with a context window and compares it with the DNN model.The context feature window size is set to five time frames.The experiments are conducted on two training targets of ideal binary mask(IBM)and ideal ratio mask(IRM).The separation performance is evaluated in terms of speech objective intelligibility,perceptual quality,and visualization of separation targets respectively.Two groups of experimental results show that both the separation performance and generalization ability of CNN-based systems perform better than those of DNN-based ones under the same experimental conditions.Finally,this thesis summarizes the main work briefly and points out the research direction in the future.
Keywords/Search Tags:supervised speech separation, deep neural network, convolutional neural network, generalization ability, objective intelligibility, perceptual quality
PDF Full Text Request
Related items