Font Size: a A A

Research On Speech Enhancement Algorithm Based On Neural Network In Complex Environment

Posted on:2022-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:Q ShangFull Text:PDF
GTID:2518306560492794Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The speech signal will inevitably be polluted by noise,reverberation and other interference factors in the transmission process.After reaching the receiving end,the auditory perception quality and intelligibility of the speech signal will be seriously reduced,which is not conducive to efficient interactive communication.In order to solve the above problems,speech enhancement technology came into being.This technology is a special case of sound source separation technology.It aims to purify and recover the speech signals damaged by various environmental interference.It is widely used in the fields of smart home,instant messaging and remote conference.This thesis mainly studies the speech enhancement algorithm based on neural network.Through in-depth analysis of the relevant research results in this field in recent years,it is found that most of the research work mainly focuses on the speech enhancement under the condition of additive noise.However,there are relatively few studies on speech enhancement in complex environments such as very low signal-to-noise ratio and multiplicative reverberation.In addition,the design of neural network loss function does not take into account the auditory perception characteristics of human ears and is not highly targeted for speech enhancement tasks,which limits the learning ability and interference suppression level of the model and increases the difficulty of recovering clean speech.As a new model structure,convolutional time domain audio separation network(Conv Tas Net)has achieved good results in the task of speech source separation.Based on Conv Tas Net neural network,this thesis will study speech enhancement and interference suppression in complex environment.The main research contents of this thesis are as follows:1.A convolutional time-domain speech separation network model based on squeezing and expanding attention(Squeeze-Expand Conv Tas Net)is proposed,which improves the residual block in the original Conv Tas Net model,removes the jump connection structure,and uses Squeeze-Expand attention mechanism to model the channel explicitly,and the dual connection mechanism is changed to a single residual connection.On the one hand,it reduces the amount of neuron parameters of the model,and on the other hand,it also improves the noise removal ability of the model.In addition,this paper uses gated convolution and PRe LU activation function to form a mask,independently generates weights for any point in the output feature space,and solves the numerical saturation problem of the Conv Tas Net output module.Finally,this paper uses a scale-invariant mean square error loss function,which effectively guarantees that the calculation of the loss function is not affected by the amplitude transformation,and can more accurately reflect the difference between the estimated speech and the pure speech.The experimental results show that compared with the native Conv Tas Net,the SEConv Tas Net proposed in this paper improves the perceptual evaluation speech quality index by 5.66% and the scale-invariant signal-to-noise ratio index by 5.34%.2.Aiming at the complex background environment where noise and reverberation exist at the same time,this thesis proposes the SEConv Tas Net-T neural network model,adds the Transfomer module driven by self attention mechanism on the basis of SE Conv Tas Net to improve the speech reconstruction accuracy of the model,and solves the long-term dependence modeling problem of input speech sequence in complex environment,especially in reverberation suppression scene.The experimental results show that compared with SEConv Tas Net,SEConv Tas Net-T improves the perceptual evaluation speech quality and short-term objective intelligibility by 5.19% and 1.49%respectively.In addition,it is found in the experiment that although the simple timedomain model combined with the time-domain optimization goal can achieve objective speech quality optimization on most time-domain indicators,it often ignores the highfrequency components of speech,resulting in the decline of its objective intelligibility,Therefore,this thesis combines the scale invariant mean square error loss function with Mel spectrum which is more in line with human auditory characteristics as the timefrequency mixing loss of SEConv Tas Net.The experimental results show that the model improves speech quality and short-term objective intelligibility by 4.1% and 0.39%respectively.
Keywords/Search Tags:Speech Enhancement, Dereverberation, ConvTasNet Neural Network, Time Frequency Mixed Loss Function
PDF Full Text Request
Related items