Font Size: a A A

Research On Deep Learning Based Speech Enhancement

Posted on:2019-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:S Y XuFull Text:PDF
GTID:2428330566970948Subject:Military Intelligence
Abstract/Summary:PDF Full Text Request
Speech enhancement is one of the most important domains in speech signal processing.As the frontend of automatic speech recognition(ASR),it plays a significant role in improving the performance in the noisy environment and overcoming the mismatching conditions.Recently,as the deep neural network(DNN)has achieved great success in speech recognition,researchers are deeply inspired to examine speech enhancement task using deep model.It can suppress some non-stationary because deep non-linear structure of DNN has strong modeling capability.Now,to improve the robustness and performance in non-stationary noises and mismatching conditions is still the core challenge.As for these problems,three achievements have been made in the design of network structure and adaptation methods.Aiming at the problem that the system is not robust in non-stationary noise and low SNR environments,this paper proposes a structure-preserving method based on sub-space DNN-weight estimating for speech enhancement.First,with the structure in the ideal ratio mask(IRM),a compositional model is proposed to decompose the IRM into a set of spectro-temporal bases and associated weights by NMF.Then,instead of directly estimating the IRM,the DNN is trained to estimate the weights as a new target that is used to linearly combine the mask bases to generate the estimated mask.The experiment results show that estimating weights can help preserve the structure and the performance of structure-preserving speech enhancement system is improved in terms of the segmental signal-to-noise ratio(segSNR),speech quality(PESQ)and intelligibility(STOI)because of considering the spectro-temporal information.Besides,the training target matrix is sparser,which leads the training process is faster than before.To ameliorate the commonly mismatching problem in life,this paper conducts studies from two aspects.One is supplying noise identity vectors(i-vectors)as input features to the network in parallel with the regular acoustic features;the other is fine-tuning the DNN using transfer learning(TL)adaptation algorithm.I-vectors,which have been successfully used to represent speaker information in speech recognition domain,are applied in speech enhancement domain to represent noise environment by tuning the training set.And the TL method can obtain new model from learned model by adaptation training and regularization algorithm.The experiment results show that i-vectore method is effective in noise type mismatching conditions,but ineffective in mismatching SNRs,while the TL method is effective in both conditions.In noise type mismatching conditions,the combination of the two methods can further improve the performance,especially in low SNRs and non-stationary environments.Aiming at the phenomenon that some unreasonable assumptions existing in mask estimated DNN-based speech enhancement system,this paper studies a novel end-to-end speech enhancement system structure based on generative adversarial network(SEGAN)and explores two activation functions.GAN is composed of a generator G and a discriminator D.G captures the data distribution,and D estimates the probability that a sample came from the training data rather than G.The training procedure for G is to maximize the probability of D making a mistake.At this time,the GAN is regarded as a mapping from noisy speech to clean speech.GAN is an end-to-end structure without feature extraction and resynthesizer which avoids the need of prior information and assumptions.Besides,GAN speech enhancement system has fewer steps and is easier to operate.In order to evaluate the performance better,we consider another evaluation metric called mPESQ,which contains the evaluation of signal distortion(SIG),intrusiveness of background noise(BAK)and overall effect(OVL).The experiment results show that SEGAN gets slightly worse PESQ than simple DNN based on masks,but in STOI and segSNR,it performs better.Besides,in mPESQ,which better correlate with speech/noise distortion,SEGAN outperforms the mask-based DNN method.It produces less speech distortion(SIG)and removes noise more effectively(BAK and segSNR).Therefore,it achieves a better tradeoff between the two factors(OVL).As for the activation functions,Leaky ReLU is slightly better than PReLU.On the whole,SEGAN gets a similar performance with mask-based DNN,but it has a simpler structure with fewer assumptions.It is a new attempt in speech enhancement and gets great prospects for development.
Keywords/Search Tags:speech enhancement, deep neural network, ideal ratio mask, non-negative matrix factorization, identity vector, transfer learning, generative adversarial network
PDF Full Text Request
Related items