Font Size: a A A

Extremely Low Signal-to-noise Ratio Speech Enhancement Method Based On Deep Learning

Posted on:2022-08-10Degree:MasterType:Thesis
Country:ChinaCandidate:W Z YeFull Text:PDF
GTID:2518306524480794Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The speech recognition system has become an effective tool in people's work,study and life.The voice assistant of smart phone We Chat,the voice-controlled TV of smart home,the voice-controlled vehicle navigation system and the automatic transcription system are all scenarios where it can be applied.However,the speech recognition system still has many flaws,which prevents it from being well implemented and applied to the required scenes.Acoustic conditions such as varying noise types,low signal-to-noise ratio,and speaker types greatly affect the performance of the speech recognition system.Therefore,a good speech enhancement module can become an important front end of the speech recognition system.However,mapping-based speech enhancement methods often introduce speech distortion while denoising.Unless the distorted speech is used to train a robust speech recognition model,it is difficult for a sensitive speech recognition system to adapt to the change.In order to improve the performance of the speech enhancement model,suppress speech distortion,and improve the accuracy of the speech recognition system,this paper proposes a speech enhancement model named DERE?Att that combines clean speech to reconstructe speech.Because the neural network treats the input data as points in high-dimensional space when processing some data,such as word vectors and speech vectors.This paper designs the embedding codec EED as the basic structure of DERE?Att.This structure comes from the ASAM model.ASAM uses a data structure called an embedding array.The embedding array is a mapping from each time-frequency element in the amplitude spectrum to a high-dimensional space.ASAM does not make good use of the neighborhood information in the embedding array,while EED solves this problem well.In DERE?Att,a memory extraction module is designed to extract feature information from pure speech.In order to prevent the speech spectrum from being directly mapped to the output,This paper used a randomly disrupted speech spectrum that is different from the clean speech spectrum of the target.An attention mechanism is designed to combine the noisy embedding array with the extracted memory to obtain an new embedding array reconstructed from clean speech features.Finally,CNNs are still used as the decoder of the array,and the noise features are removed by projection to obtain an enhanced speech amplitude spectrum.The results of metrical test of STOI and PESQ show that the performance of DERE?Att surpasses EED and LSTM under the conditions of both low and high signal-to-noise ratio.And the principal component analysis diagram visualizes this process.During the projection process of CNNs,the noise dominant points gradually converge,and the energy is set to zero;while the energy of the voice dominant points are retained.Based on the previous failed model design,this article speculates that the neural network may have a training route.It tends to choose an easier training route,which results in the two networks not being fully trained.In addition,this article also improved the EED decoder,adding a dense connection structure to make it have better performance under extreme conditions of-15 d B signal-to-noise ratio.In summary,this article proposes a model for speech distortion caused by speech enhancement and speech enhancement with extramely low signal-to-noise ratio.Compared with the original EED and LSTM,there is a better performance improvement.
Keywords/Search Tags:Speech Enhancement, Neural Network, Deep Learning, Representation Learning
PDF Full Text Request
Related items