Font Size: a A A

Research On Design Of Objective Function For Deep Neural Network Based Speech Enhancement

Posted on:2022-10-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ChaiFull Text:PDF
GTID:1488306323963059Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Speech is always interfered by background noise during transmission,which causes the speech to be seriously polluted,greatly degrades speech quality and intelligibility,and affects subsequent speech applications.Speech enhancement refers to the technique that extracts target speech from noisy speech and suppresses and reduces noise.As a front-end speech signal processing technique,it serves back-end speech applications.For different back-end speech applications,the objective of speech enhancement is also different.For example,for speech communication,the objective of speech enhance-ment is to improve speech quality and intelligibility.For speech recognition,the ob-jective of speech enhancement is to convert noisy speech to features of the recognition system that are insensitive to environmental distortion while simultaneously contain-ing a sufficient amount of discriminant information.With the breakthrough progress of deep neural networks(DNNs)in speech recognition,the application of DNNs in speech enhancement has become an important research topic.DNN-based speech en-hancement usually adopts a regression DNN to learn the complex nonlinear relationship between noisy speech and clean speech,and achieves significantly better performance than traditional single-channel speech enhancement algorithms.From the perspective of machine learning,the difficulty of DNN-based speech enhancement lies in the op-timization of complex and non-convex objective function.The minimum mean squared error(MMSE)is an objective function commonly used in DNN-based speech enhance-ment models.However,it tends to cause serious over-smoothing problems and is incon-sistent with objectives of subsequent speech applications constraining the enhancement performance upper limit.The objective function is crucial to DNN training.Under the same training conditions,a better objective function can train a better DNN model.In addition,currently,intelligent speech is successfully applied to various intelligent hardware products to realize human-computer interaction.Customization and person-alization are the new directions of development of these smart products.This paper mainly studied customized objective function of DNN-based speech enhancement for different back-end speech applications or specific acoustic scenarios.Firstly,we improved the MMSE objective function.From a statistical perspec-tive,the MMSE criterion can be considered as the maximum likelihood solution under an assumed independent,normally distributed and homoscedastic noise model.How-ever,our statistical analysis of the prediction errors in each dimension from DNN-based speech enhancement model shows that they follow leptokurtic distributions and vari-ance on each dimension is different.Accordingly,we proposed to use the generalized Gaussian distribution(GGD)to independently model the prediction errors in each di-mension.Then the log-likelihood function was derived under the probability framework and used as a new objective function for DNN-based speech enhancement.Besides,we introduced the maxmum likelihood estimation method to optimize both DNN and GGD parameters.Experiments demonstrated that the maximum likelihood objective function based GGD outperforms the conventional MMSE criterion.Furthermore,further sta-tistical analysis of the prediction errors in each dimension from DNN-based speech enhancement model shows that they follow asymmetric distributions.Therefore,we proposed to use asymmetric Laplace distribution(ALD)to independently model the prediction errors in each dimension.And then the log-likelihood function was derived and used as a new objective function.We analyzed the introduced asymmetric param-eter from both experimental and theoretical aspects,and found that it can control the optimization direction of the speech enhancement network.This provides a feasible solution to the customization of front-end enhancement algorithms for back-end speech applications.Secondly,for speech recognition applications,we first proposed a new objective measure that utilizes time-synchronized clean and noisy speech to measure recogntion performance of speech enhancement algorithms.It is defined as the cross entropy of the state posterior probabilities between the parallel noisy speech and clean speech from the DNN output of the DNN-HMM acoustic model.Experiments demonstrated that it has strong correlation with speech recognition performance.Given that this objective measure is differentiable,it can be easily used to replace the conventional MMSE cri-terion as the objective function to optimize DNN-based speech enhancement aiming at improving speech recognition accuracies,which improves the noise robustness of the back-end speech recognition system.Finally,for low-resource speaker-dependent speech enhancement,we proposed a Kullback-Leibler divergence(KLD)regularized objective function on the base of max-imum likelihood objective function.Specifically,the KLD was used to calculate the distance between the conditional probability distributions from the speaker-independent model and the speaker-dependent model and then added to the main objective function as a regularization term to constrain the speaker-dependent model not to deviate too far from the speaker-independent model.This new objective function achieved a good adaptation of the speaker-independent model to the speaker-dependent model and alle-viated the overfitting problem caused by insufficient speaker-specific data.Besides,the transfer learning strategy was adopted to further reduce overfitting.Finally,using less than 1 minute of speaker-specific clean data can achieve better speech quality and in-telligibility than the speaker-independent model trained on the multi-condition training corpus of a large data set.
Keywords/Search Tags:speech enhancement, deep neural network, objective function, maximum likelihood estimation, objective evaluation measure, KLD regularization
PDF Full Text Request
Related items