Font Size: a A A

Research On Robust Speech Recognition Based On Deep Learning In Adverse Environment

Posted on:2020-12-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H TuFull Text:PDF
GTID:1368330572978906Subject:Information and communications systems
Abstract/Summary:PDF Full Text Request
Since ancient times,speech has always been the most commonly communication tools for human beings.People can express their inner emotions efficiently and conve-niently through speech.Therefore,the progress of human society is inseparable from the promotion of speech.And speech recognition is to let the machine "understand"what people are saying,and convert the voice signal into text information,so that the machine can make corresponding feedback according to the human speech com-mand.Speech recognition is a window for human-computer interaction and plays a vital role in the intelligence of the machine.In today’s society,with the rapid develop-ment of artificial intelligence technology,people’s lives and working methods have also huge changes.People are increasingly dissatisfied with the human-computer interac-tion mode,which relying on the text and instructions of the keyboard and mouse.And they are more inclined to the convenient and quick way of speech compromise.But the generation,transmission and collection of speech signals is a very complicated pro-cess.The speech is produced by the synergy of different human vocal organs.Because of the differences in the pronunciation organs of different people,the speech spectral characteristics of different people also have great differences for the same text con-tent.In daily life,the microphone array is generally utilized to collect speech signals.Since target speech,various environmental noises,and interference speech are trans-mitted in the form of sound waves by the same medium for example air,the target speech is often destoryed.In real adverse environment,the target speech is completely masked.This poses a huge challenge to the application of our speech recognition system in real-world scenarios.According to the number of microphones,it can be divided into multi-channel speech recognition and single-channel speech recognition.C.ompared with single-channel speech recognition,multi-channel speech recognition can use more channel data,and have generally the better the recognition performance.Aiming at the robustness of speech recognition in real adverse environment,this paper introduces robust speech recognition based on multi-channel and single-channel enhancement,re-spectively,from three aspects:robust feature,speech enhancement and robust acoustic model.First,for multi-channel speech recognition,different robust features are designed for different scenes,so they have strong complementarity at the feature level.For ex-ample,we use pitch(pitch feature)pitch feature(Ghahremani et al.,2014),speaker adaptive feature(Saon et al.,2013)and normalized features in Chapter 3,allowing these features to be concatenated at the input of the acoustic model.And it can not improve recognition performance,but also save model training and decoding time(no requirement of different acoustic model training with different features).Secondly,for speech enhancement,we propose a new noise estimation method to improve the MVDR beamforming.Through experimental comparison,better performance of speech en?hancement can improve the robustness of speech recognition system without acoustic model retraining.Finally,we study the fusion of different acoustic models.Second,although the improved MVDR beamforming can bring about performance improvement,the algorithm can only be robust to stationary noise,and the non-stationary noise processing capability is poor.Therefore,for multi-channel speech enhancement,we propose a closed-loop front-end multi-channel enhancement algorithm that com-bines NN-based IRM and ASR-based VAD information.There are a total of four major innovations.First,we use neural network model-based IRM to improve time-frequency maske estimated by CGMM-based methods.Second,the segmentation result from the recognizer can be used as VAD to improve the multi-channel beamforming algorithm.Third,the estimated IRM and speech recognition based VAD can be updated with bet-ter beamforming speech,so it can form closed loop optimization by utilizing deep neu-ral network based IRM and speech recognition based VAD information.Finally,our approach combines real-time estimation of CGMM-based methods,powerful learning capabilities for deep learning,and feedback of recognition results to improve beam-forming performance in an iterative manner.Finally,for the more difficult single-channel speech recognition,a novel teacher-student learning framework is proposed,which combines classic single-channel speech enhancement,such as IMCRA,and speech enhancement based on data-driven deep learning.The trained student model can be used directly for speech enhancement as a pre-processor for the recognition phase ASR system.Through experimental analysis,we find that the traditional regression model can not learn the nonlinear relationship between the noisy speech feature and the target IRM in a real adverse environment.Compared with the unprocessed noisy speech,it cannot improve the recognition system performance.On the contrary,the student model with ISPP as the new learning goal can improve the recognition accuracy without the retraining of the acoustic model.All experiments are validated on the CHiME challenges dataset,which focuse on the multi-channel and single-channel speech recognition in real adverse environment.
Keywords/Search Tags:robust speech recognition, speech enhancement, robust features, CHiME challenge, acoustic model, student-teacher model
PDF Full Text Request
Related items