Font Size: a A A

Study On Speech Enhancement Method Based On Deep Learning

Posted on:2024-08-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L ShenFull Text:PDF
GTID:1528307118981409Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Voice is the most natural and effective way of information exchange between people,and it is also the most convenient and efficient way of human-computer interaction.In specific scenarios such as meetings and vehicles,the appearance of unknown noise,the intensity of noise,the number of microphone channels,the smoothness of noise,and different speakers can all affect the performance of intelligent speech systems and reduce system performance.This article starts with a single channel system and based on the idea of deep learning,designs a feature transformation fusion method for speech to improve enhancement performance.By adding frequency domain constraints to reduce harmonic loss,the speech recognition effect is enhanced.The research extends to the application scenarios of multi-channel microphones,integrates attention mechanisms to improve the spatial pickup effect of filter structures,and verifies the noise reduction performance of centralized and distributed array structural features.The content of research is as follows:(1)Based on the deep learning framework of codec structure,a time-domain speech enhancement method based on full convolutional network without any assumption is proposed.Speech signals have a certain correlation in time domain.The unique local perception ability of convolutional network is used to capture local feature details of speech signals.At the same time,in order to improve the perception of this convolutional operation,a two-dimensional feature transformation and fusion module based on multi-head attention mechanism is designed to achieve the global modeling of speech signals.Combining the two dimensions of time and space,extract the local features and global context characteristics of speech signals,optimize the feature transfer mode of the network,and guide the model to learn the nonlinear relationship between noisy speech and pure speech.This enhancement method has obvious performance advantages in dealing with the problems of low SNR and non-stationary noise,and effectively reduces the residual noise.(2)The constraint of frequency domain is added in the front end of time domain enhancement model,and the joint modeling of frequency and time domain is adopted to avoid the loss of key frequency spectrum during speech enhancement,which improves the intelligibility and recognition rate of enhanced speech.The enhancement model divides the processing process into two stages: frequency domain and time domain.In the frequency domain stage,a two-dimensional multi-head attention mechanism is designed to automatically extract features,establish the mapping relationship between noisy speech and pure speech in the frequency domain,and retain the harmonic features conducive to speech recognition tasks.In the time domain phase,the waveform of noisy speech and pure speech after frequency domain enhancement is used as the training data and training target.The time domain characteristics and frequency domain characteristics of noisy speech are deeply integrated in the network model.The network model is trained by minimizing the harmonic loss in the frequency domain,to achieve speech enhancement.At the same time,the high-frequency characteristics of speech are retained to the maximum extent,and the quality of enhanced speech is improved.(3)For centralized multi-channel speech enhancement scenario,a method based on speech frame level attention mechanism is proposed to design multi-channel beam-forming filter model.Firstly,the multi-channel microphone array is used to pick up the voice signal,and the features between channels are extracted on the voice frame through attention mechanism.Then,the spatial characteristics of its structure are used to conduct channel information interaction to provide a higher degree of freedom for voice signal noise reduction.Then calculate the array channel score and assign different weight values to each channel to obtain more accurate array structure characteristics.Finally,the bi-level Bi LSTM(TSBi LSTM)based on the Bi LSTM structure captures the spatial temporal relationship of the source voice.In the experimental scenario simulation,it is verified that this method has superior performance than the classical beamforming algorithm,and the generalization ability of different SNRs,different speakers and different noises is verified.(4)A multi-channel speech enhancement method using a single channel speech enhancement model to extract the features of reference signals and jointly average them is proposed to address the issue of signal energy imbalance between channels in distributed arrays.When the distributed microphone array picks up,the signals collected by different channels have unbalanced energy.First,a Pre-Net module based on the reference channel selection is designed to balance the signal energy;Then the NCC and MCS features are extracted and combined with the average method to reduce the signal difference between channels;Finally,ACG Net is designed through multi-head attention mechanism to generate noise reduction modules corresponding to microphone channels one by one,and make full use of spatial features to improve the model speech enhancement effect.It is verified on the centralized dual channel,four channel and distributed four channel structures that the distributed four channel with energy balance has the best performance and good signal-to-noise ratio generalization,which effectively solves the problem of vehicle multi-channel speech enhancement.
Keywords/Search Tags:Speech enhancement, Fully convolutional network, Feature transformation, Multi-channel, Time-frequency fusion
PDF Full Text Request
Related items