Font Size: a A A

Research On Monaural Speech Enhancement Based On Prior Information In Different Semantic Levels

Posted on:2021-01-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z H DuFull Text:PDF
GTID:1488306569484274Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speech is one of the most natural approaches for human-human communication and human-machine interaction,which is widely used in mobile communication,web conference and human-computer interface.In real life,speech signals are always corrupted by environmental noises,which can much degrade the perceptual quality of humans and the recognition accuracy of machines.From this aspect,speech enhancement,as an important technique for real-world applications,can separate the clean speeches from the background noises.After decades of development,a great progress has been made in both monaural and multi-channel speech enhancement.Monaural speech enhancement,which aims at separating the clean speeches from the background noises by only using the one-channel noisy data,has the advantages of easy deployment,less demand on running devices and high flexibility,and has become a hot research topic in recent years.In general,monaural speech enhancement is an ill-posed problem,which needs extra prior information to make the solving possible.Current speech enhancement methods always train the models to learn the time-frequency property of speech signals implicitly by only minimizing the reconstruction error.However,these methods does not extract and utilize the prior information contained in clean speech data effectively,which leads to the problem of signal distortion and spectrum over-smoothing due to lacking the explicit constraints of prior information on the enhanced speeches.This will harm the perceptual quality and recognition accuracy.To address the above problem,this thesis focuses on how to extract the prior information in different semantic levels and apply it to the proposed monaural speech enhancement methods.The semantic levels in our research are divided into three aspects,i.e.phoneme,spectrum and signal levels.In the proposed methods,we first extract the prior information by modeling the probability distribution of different semantic units,then the extracted information is used to limit the solution space of enhanced speeches.In this manner,the perceptual quality and recognition accuracy of enhanced speeches can be improved.The main research contents and contributions are summarized as follows:(1)In the aspect of utilizing the prior information in the phoneme level,we try to extract the semantic information contained in the phoneme labels by modeling the posterior probability of phoneme categories.Two methods are proposed to model and utilize the posterior probability.In the first one,an acoustic model is employed to predict the posterior probability grams(PPGs).Then,the PPGs of noisy speeches are fed to the enhancement model as a condition providing a more stationary clue.In the second one,we design the phoneme-aware network(PAN),in which a PPG predictor and an enhancement model are trained jointly and iteratively.In this manner,the enhanced features can maximize the posterior probability of their corresponding phonemes.Finally,we propose the phonemeaware network based enhancement method to utilize the semantic prior information of phoneme categories.Experimental results show that the recognition accuracy,speech intelligibility and perceptual quality of enhanced speeches can be improved by introducing the high-level prior information of phoneme categories.(2)In the aspect of utilizing the prior information in the spectrum level,we try to extract the time-frequency(T-F)structures of clean speeches by modeling the prior probability of clean spectrum segments.Through the adversarial process between a generator and discriminator,the prior distribution of spectrum segments is modeled by the discriminator.In another adversarial process between the enhancement model and discriminator,the enhanced spectra are fed to the discriminator to justify whether they are similar to the clean ones.Through the above double adversarial process,the discriminator can learn the T-F structures of clean spectra,and then the discriminator is used to train the enhancement model and makes its output satisfy the same T-F structures.Finally,we propose the double adversarial network based enhancement method to utilize the prior information of T-F structures contained in the clean speech spectra.Experimental results show that the recognition accuracy can be much improved by introducing the middle-level prior information of speech spectrum.(3)As for the prior information in the signal level,our research can be divided into two aspects.In the first aspect,we try to extract the temporal correlation of samples by modeling the probability distribution of speech waveform through the invertible functions.Specifically,we employ a normalizing flow network to model the conditional probability of speech waveforms given the corresponding acoustic features,which is trained by maximizing the log-likelihood directly.Meanwhile,a denoising autoencoder is trained to reconstruct the clean features from the noisy ones by minimizing the reconstruction error.Then,the recovered features are fed to the well-trained normalizing flow network,and the enhanced speech waveforms are obtained.Finally,we propose the invertible flow network based enhancement method to utilize the prior information of temporal correlation contained in the speech signals.To reduce the mismatch between recovered features and those desired by the invertible flow network,we further propose a joint training framework,in which these two models are stacked and jointly fine-tuned.Experimental results show that,the speech intelligibility and perceptual quality can be further improved by introducing the low-level prior information of temporal correlation between speech samples.(4)In the second aspect of utilizing the signal-level prior information,we try to extract the temporal dependence between samples by modeling the probability distribution of speech waveform through autoregressive models.Specifically,the joint probability of all samples are represented by accumulating the conditional probability of each sample.In this way,the probability distribution of the whole waveform is obtained.Meanwhile,a large scale dataset is employed to improve the speaker generalization ability.Finally,we propose the autoregressive network based enhancement method to utilize the prior information of temporal dependence contained in the speech signals.In addition,we also propose the self-supervised adversarial multi-task learning method(SAMLE),in which a self-supervised noise classifier is involved to reduce the impact of noises on the intermediate representations in denoising autoencoder.Experimental results show that,the higher speech intelligibility and perceptual quality of enhanced speeches are obtained by involving the prior information of temporal dependence between waveform samples.Compared with the invertible flow network based enhancement method,using the autoregressive network and SAMLE can further improve the speaker and noise generalization ability of the enhancement model,respectively.
Keywords/Search Tags:Monaural speech enhancement, prior information, semantic levels, probability distribution, neural network
PDF Full Text Request
Related items