We live in an extremely complex acoustic scene environment.The speech signal is inevitably disturbed by the surrounding noise.The speech enhancement technology aims to suppress noise interference,and improve the overall perceived quality and intelligibility of the speech.Speech enhancement plays an very important role in many applications such as speech communication,speech recognition and speaker verification.The single-channel speech enhancement algorithms are mainly divided into traditional speech enhancement algorithms and deep neural network-based speech enhancement algorithms.Traditional speech enhancement algorithms are theoretically supported by knowledge of statistics and signal processing,with a rigorous theoretical analysis process,and good robustness.However,in order to simplify the calculation,traditional speech enhancement algorithms need to set assumptions,which limit their ability to handle highly non-stationary noises.In recent years,deep neural network-based speech enhancement algorithms have made great progress and have shown great advantages for handling highly non-stationary noises.However,deep neural network-based speech enhancement methods show limited generalization ability to the unseen noise scene.These algorithms are less robust when the test scenarios do not match.This paper is dedicated to further improving the single-channel speech enhancement algorithm.The research is carried out in terms of both the improvement of traditional speech enhancement algorithms and the combination of traditional and deep learning methods:(1)The Bayesian-based Minimum Mean Square Error(MMSE)noise Power Spectral Density(PSD)estimation method is investigated and analyzed.The performance of MMSE noise estimation method is mainly affected by the a priori SNR estimation accuracy and its estimation bias is bound to lead to bias in the noise PSD estimates.The existing MMSE noise PSD methods focus on how to introduce bias compensation method to correct deviation,without controlling bias from the source.Considering that the speech spectral power is highly important for decision-directed(DD)priori SNR estimator,this paper proposes an MMSE spectral power estimator incorporating speech presence uncertainty(SPU)for speech spectral power estimate,while a bias compensation factor is introduced to obtain more accurate speech spectral power estimation to improve the accuracy of the DD estimation method for achieving the task of controlling the bias from the source,and further improve the ability of handling non-stationary noise.(2)Mean Square Error(MSE)distortion metric has been analyzed with the advantage that it is mathematically tractable and the MMSE noise PSD method also achieve good performance improvements,but MSE metric appears to be not perceptually meaningful.Researchers have demonstrated through extensive experiments that a distortion metric based on the MSE of the logarithmic spectra is more meaningful in terms of auditory perception.In addition,in order to reduce the risk of speech leakage,the existing noise PSD estimation methods based on time recursive averaging often come at the expense of noise update speed.In view of this,this paper exploits log-spectral power MMSE estimator combined with time recursive averaging technique for noise PSD estimation.The noise PSD estimation is updated by recursively averaging the log-spectral MMSE estimation of the noise periodogram,which reduces the possibility of speech leakage into noise PSD estimate at the source,while ensuring the noise tracking speed.(3)Speech enhancement algorithms for traditional and deep neural networks are investigated.To address the limitations of traditional methods in handling highly nonstationary noise,and the generalizability of deep learning methods,this paper proposes a speech enhancement framework based on the combination of deep neural networks and traditional methods.The performance of traditional speech enhancement methods mainly depends on the accuracy of the a priori SNR estimation,while the existing estimation methods are all limited by the assumption that the noise signal changes at a slower rate than the speech signal,thus limiting their ability to deal with highly non-stationary noises.In view of this,this paper utilizes a deep neural network-based priori SNR estimation framework(Deep Xi-TCN)to estimate the a priori SNR.Deep Xi-TCN uses temporal convolutional network(TCN)to estimate the a priori SNR without relying on assumptions,which shows excellent suppression performance for highly non-stationary noise,while the system performs robustly.(4)The representation power of network models has been thoroughly studied,and in order to further improve the characterization capability of the TCN model,this paper proposes a multi-branch TCN-based speech enhancement model(MB-TCN).MB-TCN exploits the split-transform-aggregate design idea: the input is split into several lowdimensional representations,transformed by each branch network,and aggregated by concatenation.The split-transform-aggregate design enables the Inception models to demonstrate strong representation power at a low computational complexity.At the same time,the MB-TCN model incorporates one-dimensional causal dilated convolution and residual learning to extend the receptive field to obtain long-term temporal context information.Under the robust Deep Xi speech enhancement framework,the proposed MB-TCN model outperforms multiple state-of-the-art deep learning-based speech enhancement methods in terms of five widely used objective metrics.In this paper,novel solutions and optimization algorithms are proposed to address the shortcomings and deficiencies of traditional speech enhancement methods and deep learning-based speech enhancement methods,and improve the performance and robustness of existing single-channel speech enhancement algorithms.This research has important implications for the optimization and practical application of speech enhancement algorithms. |