Font Size: a A A

Research On Deep Learning Based Speech Enhancement

Posted on:2018-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:L J LiFull Text:PDF
GTID:2348330563451274Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Speech enhancement has been applied to various domains,such as the hands-free vehicle equipment,mobile communication,teleconferencing and hearing-aids,aiming to enhancing clean utterances from noisy mixtures.As the frontend of automatic speech recognition(ASR),it plays a significant role in improving the performance in the noisy environment and overcoming mismatches between training and test conditions,which contributes a lot in declining the word error rate(WER).At present,deep learning technology is gradually replacing traditional algorithms and becoming the main algorithm for speech enhancement because it has the strong modeling capability,can fully exploit the temporal structure and temporal correlation information in speech signals.Now,speech enhancement at low signal-to-noise ratios(SNR)and in non-stationary noises is still the core challenge.As for these problems,the following achievements have been made in the design and selection of features,the establishment and optimization of models.Aiming at the problem that the existing features are not robust in non-stationary noise and low SNR environments,this thesis conducts studies from two aspects.One is improving Multi-Resolution Cochleagram Feature(MRCG),which is the most advanced feature for speech enhancement so far;the other is selecting the complementary features.Firstly,as for the problem that the original algorithm to smooth the high-resolution cochleagram in MRCG is mean filter,whose performance for noise suppression is not ideal,the thesis tries to substitute median filter,adaptive median filter and Alpha mean filter for the original mean filter in order to calculate low-resolution cochleagrams in MRCG and eventually,improve the robustness of MRCG.Additionally,the most optimized window lengths are set by experiments.Secondly,the thesis applies Group Lasso algorithm to select the complementary features from 8 existing promising features and then they are catenated as the input of Deep Neural Network(DNN).The experiment results show that the improved MRCG based on Alpha mean filter achieves the best performance.Moreover,the most complementary features,MRCG(based on Alpha mean filter)and Gammatone Feature(GF),selected by Group Lasso can improve the performance of speech enhancement system in terms of the Segmental signal-to-noise ratio(SegSNR),speech quality and intelligibility.As for DNN optimization training,this thesis applies two optimization methods.One is Restricted Boltzmann Machine(RBM)pre-training,and the other is dropout and substituting Rectified Linear Units(ReLU)for sigmoid.On the one hand,RBM pre-training can effectively study the statistical distribution characteristics of training data,and it can effectively improve the overall performance of the system,especially in the case of small training set.On the other hand,dropout can effectively avoid over fitting of the system and ReLU activation function will maximize the training effect of dropout while minimize the training time of DNN.Experiments indicate that the performance of DNN with RBM pre-training is improved in terms of all the three evaluation metrics,especially in the case of small training data and low SNRs.By contrast,the residual noise in the target speech is significantly reduced after applying dropout and ReLU to the system.Aiming to the fact that it is difficult to accurately estimate the training target in low SNRs and non-stationary noises,the thesis proposes a novel system structure where DNN and Convolutional Neural Network(CNN)are combined to estimate the training target.Firstly,DNN is applied to estimate the mask matrix since it has strong autonomous learning ability and is good at exploring the correlation between frequency bands and the temporal and spatial structures of speech signals.Afterwards,the estimated mask matrix is converted to a gray-scale map.Lastly,we use CNN to recognize mask values from the gray-scale map to reduce the interference from speech frequency shifts and noise pollution.Experiment results prove that applying CNN improves the accuracy of estimating the final training target and the performance of the whole system in no matter stationary noises or non-stationary noises,especially in the factory noise.
Keywords/Search Tags:speech enhancement, deep neural network, convolutional neural network, complementary features, ideal ratio mask
PDF Full Text Request
Related items