Font Size: a A A

Study Of Signal-sparsity Based Algorithms For Speech Enhancement

Posted on:2019-06-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:R J TongFull Text:PDF
GTID:1318330542997986Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Speech enhancement is an important research direction in speech signal processing.It has been widely used in applications such as telecommunication,hearing aids,smart appliances,human-computer interaction and intelligent conferencing systems.Speech enhancement algorithms generally make use of differences in the structural characteristics of clean signals and noise disturbances.Mathematical methods are used to transform the observed noisy speech signals into new domains.In this new domain,the distinction between speech and noise becomes more apparent.Specifically,the coefficients corresponding to clean speeches are often sparsely distributed,while the coefficients corresponding to noise are randomly distributed.Thus,a simple mathematical operation can be adopted for separation of speech and noise.However,there are still many problems that existing speech enhancement systems do not solve.For example,in many algorithms,the noise is usually assumed to be approximately stationary;this means that the noise amplitude changes relatively slowly compared to clean speech.When noise does not satisfy the assumption of stationaryness,many speech enhancement algorithms suffer from performance degradation and even cause significant speech distortion.Due to this,researchers have proposed to place the same type of microphones according to a certain geometry and form a microphone array,and based on this,many multi-channel algorithms have been invented.In addition,there are often reverberations and echoes in the real scene,which poses a severe challenge to multi-channel algorithms.In this paper,we mainly make use of the sparse feature of clean speech signals in different domains,and propose several effective speech enhancement algorithms.The work can be summarized as follows:Firstly,we propose a robust time-frequency decomposition model to account for the sparse and nonstationary characteristics of impulsive noise that may have arbitrarily large amplitude and is randomly distributed in time domain.The speech components are projected onto a discrete cosine transform dictionary,and the noise component is projected onto a unitary matrix dictionary.By controlling the sparsity ratio of two kinds of projection coefficients and using an improved orthogonal matching pursuit algorithm,the sparse projection vector corresponding to the two components can be estimated and optimized to reconstruct the clean speech components.By controlling the sparsity ratio and reconstruction error,the balance between speech distortion and noise residual can be controlled to achieve the best auditory effect.Secondly,we propose to process the multi-channel audio streams in a parallel manner for directional or non-directional noise in the real environment.We use a rectangular window of fixed length and width to move smoothly over themulti-channel audio streams with a certain speed.At each particular moment,we only perform linear transformation on rows and columns of the data matrix selected by the window,so as to achieve collaborative space-time filtering.We update the temporal filtering matrix and the spatial filtering matrix iteratively.Based on the minimum mean square error criterion,we fix the temporal filter and update the spatial filter,then we fix the spatial filter and update the temporal filter.The whole process generally converges in two to three cycles.Finally,we can get the enhanced speech data for all the channels at once.Thirdly,to make full use of the temporal and spatial information carried by the multi-channel observations,we separate the audio stream of each channel into frames and rearrange these frames into a matrix.Further,we stack these matrices for these audio streams and formulate a third-order tensor,thereby designing three filters(i.e.,the intra-frame filter,the inter-frame filter and the spatial filter)which perform space-time collaborative filtering.Based on the minimum mean square error criterion,we update three filters in an iterative manner until the process converges.Through this,we can get the enhanced speech data in all the channels at once.Finally,based on the above third-order tensor model,we propose to use tensor decomposition for noise reduction.We project the observed noisy speech tensor onto three kinds of orthogonal matrices,including universal basis matrices,supervised basis matrices,and unsupervised basis matrices.The universal basis matrix is a three-dimensional discrete cosine transform basis matrix,and the supervised basis matrix can be trained using the pre-provided clean speech data.The unsupervised basis matrix is automatically inferred from the observed noisy speech tensor.The projection coefficients are included in a core tensor of the same size.In addition,by the minimizing the statistical risk criteria,we design an optimal threshold and clear all the entries below this threshold in the core tensor to achieve noise suppression.
Keywords/Search Tags:Speech enhancement, sparse distribution, microphone array, multichannel speech enhancement, time-frequency decomposition model, discrete cosine transform, orthogonal matching pursuit, space-time filtering, minimum mean square error, tensor decomposition
PDF Full Text Request
Related items