Font Size: a A A

Speech Enhancement Based On Sparse Representation And Dictionary Learning

Posted on:2016-07-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:G Z BaoFull Text:PDF
GTID:1228330470457958Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Speech is an very important information carrier of human’s vocal communication. However, the speech signals are always contaminated and degraded by various types of interferences and noises in the real environments. The degraded speech may not only result in the auditory disgust and fatigue of the subjective auditory system, but also decrease speech intelligibility seriously. The goal of speech enhancement is to improve the speech quality and intelligibility by suppressing and eliminating the interferences and noises in the degraded speech. Speech enhancement can be divided into speech separation and speech denoising according to different types of pollution sources. The pollution source for the former is the interference speech and the background noise for the latter. Traditional speech separation and speech denoising algorithms can achieve relatively good performance under certain conditions, but they also have some boundedness. For example, the underdetermined speech separation problem is always a intractable and difficult problem whose source signals is more than the mixed signals; the ability to suppress the non-stationary noise of traditional speech denoising algorithms is very poor. In this paper, with the help of the sparse representation and dictionary learning theory, we focus on studying the above two problems and propose several kinds of speech separation and speech denoising algorithms and the main contributions and innovations includes:First of all, we present a two-layer sparsity model based underdetermined speech separation algorithm which involves two steps. At the first step, we select the time-frequency (TF) points which can satisfy the W-disjoint orthogonality (WDO) assumption well. The pricinple of the selection is according to the singular value decomposition of the covariance matrice at each TF point of the mixed signal obtained by a local average. The selected TF points are then used for clustering to get a more accurate estimation of the mixing matrix. At the second step, considering most energy of speech signal concentrate on the low frequency, we propose a two-layer sparsity model which decomposes a speech signal into two layers, i.e., the low and high frequency bands. The two-layer sparsity model based dictionary can reduce the coincidence degree of the support set of the projection over the dictionary of different source signals which can improve the separation performance. In the simulation experiment, our proposed mixing matrix estimating algorithm and the two-layer sparsity model based speech separation algorithm are compared with the conventional methods to verify their effectiveness.Secondly, we present a discriminative dictionary learning based and a hierarchical dictionary learning based single-channel speech separation algorithm. Conventional approaches learn every sub-dictionary corresponding to every source independently and separately without using the constraint of the relationship between different sub-dictionaries. When a source signal is represented sparsely over the composite dictionary, some of its components will be projected onto the non-corresponding sub-dictionary which is defined as the projection confusion. In other words, the dictionary is not discriminative enough which will lead to the poor separation performance. Therefore, we propose a discriminative dictionary learning algorithm which considers the relationship between different sub-dictionaries by making different source signals be represented sparsely by their corresponding sub-dictionaries and suppressing the representation over non-corresponding sub-dictionaries. Furthermore, because the single-layer discriminative dictionary learning method always results in a certain projection confusion, a hierarchical dictionary learning method is presented which restrict the dictionary at multiple layers, to reduce the projection confusion and increase the discriminative of dictionary. In order to verify the advantage of the algorithm, we give some simulation experiments in which the speech separation performance of our proposed discriminative dictionary learning and hierarchical dictionary learning methods is compared with that of the traditional methods.Finally, we put forward a joint dictionary learning algorithm to learn the speech and noise dictionaries jointly and a signal-feature dictionary learning algorithm to learn the signal and feature dictionaries. Traditional methods learn the speech dictionary and noise dictionary independently and separately and then represent the mixture of speech and noise sparsely over the composite dictionary to achieve noise reduction. These approaches result in relatively serious source confusion, i.e., some of the speech ingredients are represented by noise dictionary atoms and vice-versa. In order to increase the difference and the discrimination between speech and noise dictionaries, we use the noisy and clean signals to learn the speech and noise dictionaries jointly, which minimizes the approximation error of the sources over corresponding dictionaries and the coherence of dictionaries. The learned dictionaries can minimize the source confusion further at the enhancement stage. In addition, in order to take advantage of the neighbor infromation of every TF point of speech and noise signals, we present a new feature which incorporates the neighbor weight information, and the signals and features of speech and noise are used to learn signal and feature dictionaries jointly. The gap between signal and feature dictionaries are bridged by the same sparse representation coefficients over the signal and feature dictionaries. At the enhancement stage, the signal dictionary can be exploited to get an estimation of speech directly and the feature dictionary and TF mask technique can be used to obtain another estimation of speech. The two estimations are weighted to synthesize the final enhanced speech. The suppression ability for non-stationary noise of our proposed joint dictionary learning and signal-feature dictionary learning is compared to the traditional method in the simulations.
Keywords/Search Tags:Speech enhancement, speech separation, speech denoising, sparserepresentation, dictionary learning, two-layer sparsity model, TF masking, discriminative dictionary learning, hierarchical dictionary learning, joint dictionarylearning
PDF Full Text Request
Related items