Font Size: a A A

Research On Speech Enhancement Algorithms Based On Deep Learning

Posted on:2022-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:F L KongFull Text:PDF
GTID:2518306740996389Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
As an important branch of speech signal processing,speech enhancement has important applications in the fields of speech communication,hearing aid,automatic speech recognition(ASR)system front-end,etc.Although previously proposed traditional single-channel speech enhancement methods are computationally simple,their noise reduction effect is not good,especially for non-stationary noise.Deep learning algorithms that have emerged in recent years have significantly improved the performance level of single-channel speech enhancement.However,speech enhancement models based on deep learning normally fail to effectively generalize to real scenes.What's more,real-time noise reduction processing on mobile or wearable devices is also an important application today,but computationally intensive deep learning models are difficult to deploy on these devices with very limited resources.Based on existing works,this paper studies single-channel speech enhancement algorithms based on deep learning.While pursuing high performance,we are committed to keeping low enough computational complexity and latency to meet real-time requirements on terminals.The main work and innovations of this paper are as follows:(1)Firstly,the definition,classification and research significance of speech enhancement are summarized,and the development history and research status of single-channel speech enhancement are reviewed.Then three common traditional single-channel speech enhancement algorithms are focused on,including spectral subtraction,Wiener filtering,and minimum mean square error(MMSE)amplitude spectrum and log amplitude spectrum estimation method,and a Wiener filtering method based on prior signal-to-noise ratio is used as the baseline algorithm for the experiments in this paper.Finally,feature extraction and training objectives in supervised speech enhancement are introduced in detail,which lays the foundation for the research work of this paper.(2)A new RNN structure called Attention LSTM is proposed by introducing self-attention in LSTM.Attention LSTM replaces the input gate and the forget gate in LSTM with an attention gate.The attention gate determines how much cell state from the previous time step is retained,and it is only calculated based on the cell state from the previous time step.This is the essence of the self-attention mechanism of Attention LSTM.Based on RNNoise,a real-time single-channel speech enhancement model based on RNN,a new ratio mask representation exploiting inter-channel correlation(ICC)is used as the training target.Experiments on the dataset provided by the ICASSP 2021 Deep Noise Suppression(DNS)Challenge show that RNNoise outperforms the Wiener filtering algorithm by a large margin,with an improvement of 0.2 in PESQ,and this new ratio mask can further improve the enhancement performance.In addition,Attention LSTM achieves performance comparable to LSTM and GRU with lower complexity.(3)A feature-independent convolution called spatially variant convolution is proposed.The core idea is to learn a different convolution kernel for each output feature dimension.In order to control the number of parameters more flexibly,a grouped spatial variable convolution is subsequently proposed.The idea is that adjacent feature dimensions in output feature maps share a convolution kernel.Firstly,a complex-domain fully convolutional network based on U-Net,DCUNet,and its real-domain version DUNet are separately trained with the complex ideal ratio mask and the real ideal ratio mask as the target.The experimental results show that the overall performance of DUNet is better than that of DCUNet.Then,on the basis of DUNet,depth-wise separable convolution in Mobile Nets is used,which factorizes a standard convolution into a depth-wise convolution and a 1×1 convolution called a point-wise convolution.The experimental results show that depth-wise separable convolution reduces computational complexity while leading to a significant performance drop.Subsequently,depth-wise separable convolution and spatially variant convolution are combined.The experimental results show that combining these two convolution structures can improve the performance of the network while maintaining a low computational complexity.Finally,the attention mechanism is incorporated into DUNet,that is,an attention layer is inserted between the encoder and the decoder.The experimental results verify the effectiveness of integrating the U-Net architecture and attention.(4)Two novel CRN-based single-channel speech enhancement models are proposed.One is a ratio mask-based CRN(CRN-RM),which incorporates GRU on the basis of DUNet.The other is a CRN with an Encoder-Generator architecture(EG-CRN),which replaces the decoder in U-Net with a generator composed of recurrent layers and fully connected layers.The experimental results show that CRN-RM achieves significantly better performance than DUNet with fewer parameters,which verifies the effectiveness of the CRN architecture.Compared with a CRN based on amplitude spectrum mapping(CRN-MM)and a CRN based on complex spectrum mapping(DCCRN)that are previously proposed,the overall performance of CRN-RM is higher than CRN-MM and lower to DCCRN,but its computational cost is only 1.5% of DCCRN's.Compared to CRN-RM,EG-CRN further reduces the computational complexity,without significant performance loss.
Keywords/Search Tags:single-channel speech enhancement, deep learning, recurrent neural network, U-Net, convolutional recurrent network, attention
PDF Full Text Request
Related items