Font Size: a A A

Research On Beamforming Technology Of Deep Learning Far-field Speech Recognition

Posted on:2022-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:X B GuoFull Text:PDF
GTID:2518306521957909Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of deep learning,speech recognition technology has risen again and entered a new stage of development.In the near-field acoustic environment,the recognition accuracy of Automatic Speech Recognition(ASR)is better than the human's.However,in practical application,there are always noise,reverberation,echo,human voice and other interference in captured speech signal,which results in a serious decline in recognition accuracy.Compared to the near-field acoustic environment,the far-field acoustic environment refers to the acoustic environment with a distance of 1m to 10 m between the sound source and the receiver.And it includes most of the practical application scenarios of ASR systems,such as intelligent sound,wearable devices and hearing aids etc.Far-field ASR technology can improve speech recognition performance in complex acoustic environment,thus it is an important support point for speech recognition technology to be applied to people's daily life.Meanwhile,it is also a difficult and hot issue in the field of speech recognition.In order to promote the performance of far-field ASR technology,many international competitions have been held,such as CHi ME,REVERB etc.By analyzing CHi ME-4 challenge's far-field ASR system,this dissertation deeply studies the frontend speech enhancement—beamforming algorithm.The main research content are as follows:Aiming at the data mismatching problem caused by the supervised training method of the neural network(NN)based mask estimation for beamforming,and inadequate use problem of signals' information in source presence probability estimation based on real-valued mask,an integrated NN-based and spatial clustering(SC)based mask estimation for beamforming is proposed.This method will improve the accuracy in the estimation of sound source presence probablity in two aspects.On one hand,the estimated mask by the neural network is converted into sound source presence probability and used as the initial mask of SC-based mask estimator.The unsupervised estimation of the SC-based method is used to relieve the data mismatching problem of NN-based mask estimator;On the other hand,the complex-valued time-frequency mask is introduced into the integrated method,and the accuracy of estimating the source presence probability is improved by making full use of the signal's amplitude and phase information.Experimental results show that the integrated method effectively alleviates the data mismatching problem of NN-based mask estimation.And the introduction of complex-valued time-frequency mask improves the accuracy of estimating the source presence probability.The proposed method achieves 8.37% relative reduction over the baseline system in terms of average word error rate.Aiming at the inaccuracy problem of estimating complex-valued time-frequency mask based on real-valued neural network,an integrated complex-valued neural network(CVNN)based and SC-based mask estimation for beamforming is proposed.On one hand,the complex-valued fully connected network is used as the backbone network of the mask estimator.And the correlation between the real and imaginary parts of complex numbers is used to reduce the degree of freedom of the neural network and improve the accuracy of the complex-valued time-frequency mask estimation;On the other hand,a complex-valued LSTM network is bulit based on the real-valued LSTM network,and is used as the backbone network of the mask estimator.Contextual information is introduced by using the robust memory mechanism of the complex number and the memory ability of LSTM.Thereby,the accuracy of time-frequency mask estimation is improved,and the performance of far-field speech recognition is boosted.Experiments prove that using complex-valued fully connected neural network in an integrated framework achieves 2.73%relative decrease in the average word error rate than using real-valued fully connection neural network in an integrated method.And the significance test verifies the effectiveness of performance improvements,which is not caused by randomness.In addition,experiments prove that the expected performance improvement in the mask estimation can be achieved by using the complex-valued LSTM network in an integrated framework.However,the performance of beamforming is not satisfactory.It may be caused by the data overflowing problem during estimating the time-frequency mask based on the complex-valued LSTM network,which has an impact on the solution of the source presence probability and the beamforming filter coefficients.The mask estimator integrating NN-based method and SC-based method does not solve the data mismatching problem caused by supervised learning,so an unsupervised mask estimation method based on Neural Expectation Maximization(Neural EM)is proposed.On one hand,mask estimator based on Neural EM algorithm expands the iterative steps of the EM algorithm into a network sequence layer in the deep network,and replaces the M step in the EM algorithm with backbone neural network to complete the update of the probabilistic model's parameters.Consequently,it further combines NN-based method with model-based method and realizes unsupervised mask estimation based on neural network.On the other hand,by replacing the iterative steps of the EM algorithm with the internal recursive structure of RNN and extracting more robust features through encoder for updating parameters of the probabilistic model,an unsupervised time-frequency mask estimator based on RNN-EM is realized,which improves the robustness and accuracy of unsupervised time-frequency mask estimation.Experiments show that the unsupervised time-frequency mask estimator based on Neural EM is feasible,and the timefrequency mask estimator based on RNN-EM is better than the time-frequency mask estimator based on Neural EM.
Keywords/Search Tags:Far-field speech recognition, beamforming, time-frequency mask estimation, integrated method, complex-valued neural network, Neural EM
PDF Full Text Request
Related items