Font Size: a A A

Research On Sparse Representations And Deep Learning Based Supervised Speech Enhancement

Posted on:2021-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y ZhuFull Text:PDF
GTID:2428330602498960Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
As a carrier of linguistic sign system,speech is one of the important media for hu-man to express their thoughts and emotions.However,the ubiquitous interference and noise in realistic production and life usually contaminate the speech signals,leading to the reduction in speech quality and intelligibility.It also causes the discomfort for human auditory perception and brings challenges to back-end applications like speech recognition.Therefore,speech enhancement is regarded as a significant issue in the field of speech signal processing,which is defined as a technique to suppress and elim-inate interference or noise in degraded speech signals to improve speech quality and intelligibility.Since the 1970s,researchers have been devoted to research on monaural speech enhancement and proposed the traditional algorithms,such as spectral subtrac-tion,statistically optimal methods and subspace methods.But these algorithms rely heavily on assumptions about the characteristics of speech signal and noise and the re-lationships between them,so the enhancement performance of algorithms are limited.Especially when non-stationary noise exists,these algorithms usually introduce some nonlinear distortions which affect the auditory feelings and back-end speech recognition or speech codecs.In recent years,with the development of technology,collecting speech data has become faster and more convenient so data-driven speech enhancement algorithms have been proposed.The main idea of these algorithms is to utilize model to explore the features and properties of training data,without relying on any presuppositions.So they are applicable to complex acoustic environments.Based on above background,this dissertation applies sparse representation theory and deep learning algorithms to address monaural speech enhancement problem.First,we consider enhancing speech with two types of non-stationary noise co-existing based on complementary joint sparse representations(CJSR)method.In the dictionary learning stage,joint dictionary learning is constrained by the mapping rela-tionships between noisy speech and clean speech,noisy speech and noise,so the learned dictionaries explore not only the spectral structures of speech signals and noise,but also relationships among signals.Consequently,the completeness and distinguishability of dictionaries can be enriched,which alleviates the source confusion and source distor-tion problem.In the enhancement stage,we propose the weighting parameters based on the residuals of estimated signals to combine different signals,since the effective-ness of estimated signals obtained by sparse representations are different under various conditions.Secondly,popular approaches of speech enhancement in frequency-domain gen-erally utilize the frequency-domain information,such as short-time Fourier transform(STFT)magnitudes or log-power spectrum,while the phase of enhanced speech is re-placed by the STFT phase of noisy signal.Therefore,the mismatch between the en-hanced magnitude and the noisy phase will most likely lead to an inconsistent spectro-gram problem.However,compared to the samples in time domain,the value of a time-frequency bin can represent the energy of the corresponding frequency component,and the underlying phones and harmonic structure in speech signal are more distinguishable from the background noise in time-frequency domain.Based on above background,we propose a novel fully convolutional neural network(FCN)called FLGCNN for the end-to-end speech enhancement.The proposed FLGCNN is mainly built on encoder and decoder,but the extra convolutional-based STFT(CSTFT)layer and inverse STFT(CISTFT)layer are added to encoder and decoder,in order to introduce frequency-domain information to help enhancing speech.Furthermore,the encoder and decoder are constructed by the gated convolutional layers so that the receptive field can be en-larged and the proposed model can better control the information passed on in the hier-archy.The temporal convolutional module(TCM)is inserted between encoder and de-coder to better model the long-term dependencies of speech signal.Since the proposed end-to-end model can perform speech enhancement in an utterance-wise manner,we also optimize the proposed model with different utterance-based objective functions to exploit the impact of loss functions on performance.
Keywords/Search Tags:Single-channel speech enhancement, Supervised learning, End-to-end speech enhancement, Sparse representations, Joint dictionary learning, Fully convolutional network
PDF Full Text Request
Related items