Font Size: a A A

Research On Deep Neural Network Based Speech Enhancement

Posted on:2016-10-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y XuFull Text:PDF
GTID:1228330470458004Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Speech enhancement is one of the most important branches in speech signal pro-cessing. In the past several decades, many unsupervised methods had been proposed, most of them were done with estimating noise as the first step, then subtracting the noise spectrum from the noisy spectrum. But due to the non-stationary factor in noise, the tracking for noise is very difficult. Meanwhile some assumptions, like some inde-pendence and Gaussian assumptions between signals, should be made considering that the relationship between the noisy speech and clean speech was complicated. Because of these assumptions, it first produced much residual noise, even musical noise. Sec-ond, the details of speech were damaged especially in low Signal-to-noise ratio (SNR). Third, the extreme non-stationary noise was always the intractable problem due to its abruptness, this also leads to under-estimate of noise spectrum, and it was difficult to be removed from noisy spectrum. But in real-world situations, the non-stationary noise was produced with large probability. Finally, the traditional speech enhancement meth-ods would introduce some non-linear distortion, resulting in damage effect in the fol-lowing speech recognition or speech coding.Recently, as the deep neural network (DNN) has gotten great success in speech recognition, it gives much inspiration for speech enhancement task using deep model. The deep structure of DNN could be designed to be a de-noising filter. Furthermore, DNN can well learn the relationship between the noisy speech and the clean speech based on big data. In addition, DNN is learnt in off-line mode, it can remember some noise pattern just like our human, so it can suppress some non-stationary noise. In this paper, we proposed a DNN-based method without any assumptions, and conducted a series of research of enhancement even for real-world noisy speech.Firstly, we proposed the DNN-based speech enhancement. the log-power spectra (LSP) was used as the feature. DNN is the mapping function to predict the clean speech from noisy speech. Two steps were included in DNN training, namely pre-training and fine-tuning. Pre-training is conducted based on Restricted Boltzmann Machine (RBM) in layer-by-layer mode to avoid the stuck in the local minima. Fine-tuning could accurately learn the non-linear relationship between the noisy and the clean speech.Secondly, DNN is a supervised model faced with the mismatch problem. Hun- dreds of noise types were used to train the model to improve its generalization capacity. Meanwhile we found that it can well suppress the non-stationary noise. Noise aware training can further improve its separation ability. Dropout was used to improve the DNN training to avoid the over-fitting. Global variance equalization was also proposed to improve the quality of speech.Thirdly, adaptation should be conducted in the face of mismatch problem, includ-ing energy mismatch, noise mismatch and language mismatch. So we proposed mean shift method to address the energy mismatch. Dynamic noise aware training was pro-posed to alleviate the noise mismatch, namely IBM was adopted to estimate the noise which was used as auxiliary info to help DNN. Language mismatch is due to the diver-sity between languages, transfer learning method was proposed to address this problem.Finally, minimum mean squared error in the LPS domain was used to train DNN. It was difficult to optimize this cost function directly. An in-direct method was proposed by joint optimization with Mel Frequency Cepstrum Coefficient (MFCC). This MFCC info was regarded as a restriction term to help the prediction of LPS. Some category info, like IBM and clustered noise code were also adopted as the restriction term on the output of DNN.In the end of this paper, we gave the summarization and made some plans for the future.
Keywords/Search Tags:speech enhancement, deep neural network, non-stationaral-ization capacity, ideal binary mask
PDF Full Text Request
Related items