Font Size: a A A

Research On Deep Learning Based Far-Filed Speech Recognition

Posted on:2020-06-17Degree:MasterType:Thesis
Country:ChinaCandidate:C R LiuFull Text:PDF
GTID:2428330620953230Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Speech is an important way of communication for human being.With speech people can convey information and express emotion.With the development of computer technology,scientists began to study how to make computers "understand" human language,and speech recognition came into being.Speech recognition has always been seen as a link and bridge between people and machines.It can help people streamline work processes,improve work efficiency,and make human-computer interaction more convenient and efficient.Therefore,it is widely used in many fields and has great development prospects in many undeveloped fields.Far-field speech recognition is an important branch in the field of speech recognition.This technology has important applications in smart home,conference transcription,car navigation and other scenarios.Far-field speech recognition system usually uses microphone arrays to collect speech.Due to the large amount of background noise,multi-path reflection,reverberation,and even vocal interference in the real environment,the quality of the speech signal is degraded.Generally,the far field speech recognition accuracy rate is significantly lower than that of near field speech recognition.As for these problems,three achievements have been made in the design of network structure and adaptation methods.Aiming at the problem of iterative complexity,low efficiency and noiseless assumption of weighted prediction error(WPE),this paper proposes a speech enhancement method using long short term memory(LSTM)to estimate WPE.The LSTM is trained to suppress the noise to satisfy the noise-free assumption,and the ideal ratio mask(IRM)of the speech is obtained.Then the variance of the noise-free reverberation spectrum and the expected speech spectrum is estimated by the IRM,finally the estimated voice is calculated according to the formula.Experiments have shown that using LSTM to calculate WPE reduces computation time and improves speech quality significantly.Aiming at the problem that beamforming has weak ability to suppress specific noise and has residual noise,this paper proposes a speech enhancement method based on Wiener post-filtering with optimized DNN.In this paper,the DNN was trained by far-filed speech magnitude spectral to get the clean speech magnitude spectral estimator and noise magnitude spectral estimator.Then the enhanced speech magnitude spectral was obtained as the output of the DNN by calculating the Wiener gain function,which was calculated by the estimators.Close-talk speech was used to supervise the training,and the network weights was update by back propagation.In addition,considering the correlation of adjacent frames of the speech,this paper optimized the objective function of DNN.Experiments show that DNN post-Wiener filtering further suppressesnoise and improves speech quality.The optimization of objective function makes DNN achieve better modeling effect for far-field speech,and has a significant improvement in the speech recognition rate.Aiming at the problem that the language model is not capable of modeling high-frequency vocabulary in far-field dialogue speech,this paper proposes an improved recurrent neural networks language model(RNNLM)for the improvement.The method applies Fast Marginal Adaptation(FMA)to multiply the RNNLM probability from the baseline system by a factor specific to each word and then renormalize it.These factors are estimated from DNNs trained using transcription text.Experiments show that the adaptive RNNLM has less perplexity and a higher correct rate when decoding.
Keywords/Search Tags:far-field speech recognition, deep neural network, beamforming, weight prediction error, Wiener post filter, language model, CHiME-5
PDF Full Text Request
Related items