Font Size: a A A

Research On Single Channel Speech Enhancement Based On Multi-head Attention Mechanism

Posted on:2022-10-16Degree:MasterType:Thesis
Country:ChinaCandidate:W W YuFull Text:PDF
GTID:2518306542462854Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Speech is the most important means of communication between people.With the development of the times,speech is also an important means of interaction between people and computers.However,in many environments,the speech signal will be interfered with by other signals,which affects the efficiency and effectiveness of communication.Therefore,how to effectively improve the voice quality in a noisy environment is of great significance.Speech enhancement is the primary technology to improve the quality and intelligibility of the target speech signal under noisy conditions.In recent years,with the development of deep learning,monophonic speech enhancement algorithms have made considerable progress.The recurrent neural network has become a common model in speech enhancement tasks because it can naturally model the sequence relationship of speech.However,there are two problems with recurrent neural networks.First,the disappearance and explosion of gradients during long-term dependence seriously affect recurrent neural networks' performance.Second,the output of the previous step of the recurrent neural network will be used as the input of the current step,so it is difficult to parallelize the sequential calculation process.This problem limits its real-time processing capability,which is an important requirement for speech enhancement applications.As an alternative to recurrent neural networks,networks based on the multi-head attention mechanism are also unable to model speech signals well due to the limitations of the position embedding module.Hence they cannot play the powerful potential of their models.All of these limit the further development of speech enhancement.This essay focuses on the study of the speech enhancement model based on the multi-head attention mechanism.The main research work is as follows:First,to effectively utilize the advantages of the multi-head attention mechanism and improve the utilization of the positional information of the speech signal,a new speech enhancement model is proposed,which is based on the standard Transformer model structure.Specifically,Long Short Term Memory(LSTM)is used instead of the position embedding module to construct the position sequence information of the voice signal input.The LSTM is a variant of the recurrent neural network.At the same time,in order to avoid the shortcomings of recurrent neural network gradient explosion or disappearance and inability to parallelize,a new calculation method is used,which is Local Long Short Term Memory(Local Long Short Term Memory,Local LSTM).Theoretically,the new speech enhancement model can effectively use the positional information of the speech signal and conveniently parallelize the reasoning operation.Experimental results show that the new model can consistently achieve better performance in speech quality and speech intelligibility under an unknown noise environment compared with the baseline model.The speed of the new model has increased a lot.Second,considering the above model training optimization uses the mean square error of the amplitude spectrum of the speech signal,there is no direct connection with the evaluation metric of the speech enhancement model.Theoretically,it will affect the performance of the model.Therefore,SI-SDR(Scale-invariant signal-to-distortion)optimization is further considered based on the above model.However,the SI-SDR is calculated from the time-domain waveform.The voice signal characteristics output by the above model are in the time-frequency domain,so the backpropagation training cannot be performed by directly changing the optimization function.Therefore,the model structure is modified based on the above model.The Fourier transform,which is realized by a one-dimensional convolutional layer,is integrated into the model.The new model can directly input and output the speech waveform,which is simplified the model training process.At the same time,the experimental results show that the new model structure achieves good performance.
Keywords/Search Tags:Speech Enhancement, Deep Learning, Multi-head self-attention, Recurrent Neural Networks
PDF Full Text Request
Related items