Font Size: a A A

Muti Objective Learning And Ensembing For Deep Neura Network Based Speech Enhancement

Posted on:2019-05-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:1318330545452466Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the enhancement of the functions of smart terminals and the improvement of cloud computing capabilities,the way of communication between peo ple and computers has changed a lot.As the most important,most commonly used and most convenient way of information exchange,speech becomes an indispensable medium.It is often accompanied by various kinds of noise interference in speech com munication in daily life or in military communications.These unexpected noises not only affect the quality of speech,but also make it difficult to handle subsequent speech recognition and voice activity detection.The main goal of speech enhancement is to extract the original clean speech from the noisy speech in order to improve the speech quality and intelligibility.The traditional single channel speech enhancement algorithms are basically unsu pervised.Generally,they make certain assumptions about the probability distribution of speech signals and noise information.They do well in stationary noise conditions,but have weaker ability to suppress non stationary noise.In recent years,as deep learn ing technology has made great breakthroughs in the field of speech signal processing,it provides a supervised learning solution for speech enhancement.It found that the Deep Neural Network(DNN)based speech enhancement method achieved a great per formance improvement over the traditional speech enhancement algorithm.This article focused on the use of regression DNN to map the complex nonlinear relationship be tween speech and noise,in order to improve speech intelligibility in low SNR conditions and to suppress non stationary noises.Then a multi objective learning and emsembling architecture was proposed with compact and low latency design,which is suitable for real time applications.Finally the parameters of the DNN based on time frequency masking was optimized in the framework of maximum likelihood estimation.Firstly,based on the existing DNN speech enhancement algorithm framework,the impact of different feature types on system performance was investigated,and tried to improve speech intelligibility at low SNR.By designing different feature types,such as log power spectra(LPS)and amplitude spectra(AS),the learning behavior of regression DNN was investigated.By leveraging on the complementarity between different feature types,the feature concatenation for the input layer and post processes with different learning objectives for the output layer were proposed to improve both speech quality and intelligibility.Secondly,for the problem of noise mismatch on wideband(16k Hz)speech data,an improved dynamic noise estimation method that uses double absolute thresholds,smoothing strategies,and interpolation with static noise to make the estimated fullband noise more accurate was proposed.Then by using the subband feature,which can reduce the model complexity,of noise and the ideal ratio mask(IRM)representing the presence probability of speech,joint aware training was performed to improve the generalization capacity of the model to unseen noise types.Thirdly,a multi objective learning and ensembling architecture based on DNN for speech enhancement was proposed,including two stages of multi objective learning and multi objective ensembling,which can achieve better results with less model complex ity and lower latency.The proposed architecture is more suitable for real time speech applications.In the multi objective learning stage,a DNN model was employed to pre dict an ensemble of three feature subsets for LPS,Mel frequency cepstral coefficient(MFCC)and Gammatone frequency cepstral coefficient(GFCC),in which each subset consists of a feature for clean speech and dyanmic noise,and its corresponding IRM.In the multi objective ensembling stage,the auxiliary information learned in the pre vious stage was used as the network input together with the original noisy signal,and the clean speech and IRM information corresponding to LPS,MFCC,and GFCC are simultaneously predicted at the output layer.Finally,post processing was performed by leveraging on these two stages of DNN.Due to the use of multiple objectives in the DNN learning process,the two stage network could be designed to be very compact to reduce the model complexity,and maintained good performance even at low latency.Finally,in the framework of probability distribution,assuming that the IRM pre diction error follows the generalized Gaussian distribution,the ideal ratio mask(IRM)was used as the objective of the DNN,and the maximum likelihood estimation method was adopted to optimize the DNN parameters.The IRM prediction error distribution was discussed for different shape parameters of the generalized Gaussian distribution.Choosing the proper shape parameter,the maximum likelihood estimation method could achieve a significant improvement in all objective metrics compared with the minimum mean square error method,with less speech distortion and more speech information in high frequency bins.
Keywords/Search Tags:speech enhancement, deep neural network, multi objective learning, multi objective ensembling, compact and low latency design, maximum likelihood estima tion
PDF Full Text Request
Related items