Font Size: a A A

Single Channel Speech Enhancement And Separation

Posted on:2022-03-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:Rizwan UllahFull Text:PDF
GTID:1488306323962899Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Communication is the key objective of speech,that is the transmission of information from the transmitter to the receiver end.Speech signals are extensively used in speech signal processing systems,such as intelligent conference,smartphone,hearing aids,and in many other systems of human to human,and human to machine communication.However,the intelligibility and quality of speech signals are often degraded by different kinds of environmental noises.Therefore,it is very important to enhance the speech signals through speech enhancement algorithms and systems.But the estimation of clean speech signals from the noisy signals in single channel is both challenging and difficult.This is because of the limited information availability in single channel and the non-stationary nature of noise.Because,most of the noise in real environments have non-stationary nature and speech like characteristics.So it is always the need of modern day speech communication systems to estimate the desired clean speech from the speech signal mixtures.The key purpose of the speech enhancement and separation framework is to estimate the desired signal from the signal mixture without significantly degrading the quality and intelligibility of the desired speech signal in order to meet the requirements of the user.The conventional supervised sparse non-negative matrix factorization(SNMF)involves training and testing stages.The training stage needs prior information and data for building the speech and noise dictionaries.In practical scenarios,it is not always possible to have access to clean noise data.Furthermore,even in case of availability of training data,there is a possibility that the kind and nature of noise involved in the noisy mixture is different than the available training data.This makes it challenging and degrades the speech enhancement algorithm performance and is one of the main limitations of supervised SNMF.The second problem is regarding the speech mixture dereverberation and separation for single channel,which can be hardly reported previously.There are some individual papers on speech dereverberation and speech separation.Some papers propose speech derverberation and separation in a single paper but for multichannel,and mostly the other source is considered noise.In multichannel we have more information available than single channel and separation of noise and speech is convenient than the speech-speech mixture due to the distinct characteristics between speech and noise.The problem becomes challenging and difficult in the scenario where only speeches are involved with in the mixture along with reverberations in the single channel.The third problem is the time frequency resolution issue of short-time Fourier transform(STFT)that is used in the traditional speech enhancement methods.Furthermore,only magnitude spectrum is enhanced in the traditional speech enhancement,and then the noisy phase is used to reconstruct the final estimated speech.Therefore,the quality and intelligibility of the enhanced speech is not very good.Our aim is to achieve the best possible form of the desired signal from a noisy mixture with acceptable speech quality and intelligibility with minimal audible distortion.In the first work,a novel semi-supervised single-channel speech enhancement algorithm is proposed,that uses the optimally-modified Log-spectral amplitude estimator(OMLSA),voice activity detector(VAD),correlation and sparse non-negative matrix factorization(SNMF).The algorithm first extracts noise from the original noisy speech signal using OMLSA.The extracted noise contains some residual speech components that need to be removed.For this purpose,some clean speech is passed through VAD to remove the silence zones from the clean speech.Then,correlation is taken between this VAD passed clean speech and the extracted noise from the OMLSA part.Based on the high correlation coefficient between the VAD passed clean speech and the residual speech components in the extracted noise from the OMLSA part,the residual speech components are detected and removed from the extracted noise.This makes the extracted noise in the purest possible form.The use of VAD is necessary to remove the silence zones in the clean speech,because some of the residual speech components may correlate with the silence zones,hence may not be detected and properly removed.The performance of the algorithm was evaluated and it was found that it can significantly recover the objective speech quality and intelligibility and outperforms related methods including unsupervised and supervised methods.Secondly,a speech derverberation and separation method is proposed based on robust principle component analysis(RPCA)and SNMF that can be hardly reported previously.There are some works performing derevereberation and separation individually,and are separate papers(means not in one paper).In some of the research works,speech derevereberation and separation are performed in a single work but for multi-channels,and mostly the other source is considered noise.The situation becomes difficult,when we have speeches with no noise at all,and there is a single channel that means that we have limited information in hand.Therefore,in this method we propose to use a robust unsupervised method for dereverberation that is robust principle component analysis.We do not use any kind of training data or information for derverberation using RPCA,contrary to many derverberation methods that require some prior training or information.At first,the reverberated speech is dereverberated by using robust principle component algorithm,that does not require prior information or prior data for dictionary learning and is robust.The late reverberation which is mainly responsible for the distortion of the signal is suppressed.The dereverberated speech mixture is then separated by using the sparse non-negative matrix factorization.After obtaining the initial estimates of the signals,masking is used to estimate the final signal estimates.This method works well and successfully separate and dereverberate the reverberant speech mixture,and significantly increase the quality and intelligibility of the dereverberated and separated speech.Thirdly,a novel single-channel speech enhancement algorithm is proposed,that uses a double transformation composed of DTCWT and STFT and jointly learns the real,imaginary and magnitude parts of the signal through the generative joint dictionary learning algorithm.The initial transform used for the input signal is DTCWT,that overcomes the signal degradation produced by the down sampling of the DWPT and allocates a set of coefficients.The signal is decomposed into a set of subband signals by DTCWT.The second transformation belongs to STFT,that builds a complex spectrogram by applying STFT to each coefficient.By applying STFT on every subband signal,the real part,imaginary part,and magnitude of each subband signal is obtained,and preserve the phase part for further processing.It uses the GJDL method to build the joint dictionary,and then utilize the batch least angle regression with a coherence criterion(LARC)procedure with a consistent standard for sparse coding.An initial estimate is obtained and the real and imaginary parts are combined.A subband binary ratio mask(SBRM)is used to produce a signal,and the estimated enhanced magnitude part with the phase develops the second signal.Since the two signals that are acquired from the above procedures have different accuracies,therefore Gini index is used to combine and produce the last estimated clean speech signal.The suggested algorithm has the best performance,compared with the available algorithms in all considered evaluation metrics.
Keywords/Search Tags:Speech Separation, Speech enhancement, Sparse non-negative matrix factorization, Robust Principle Component Analysis, Dual-tree complex wavelet transform, Short-time Fourier transform, Joint dictionary learning, Gini index
PDF Full Text Request
Related items