Font Size: a A A

Multi-modal Speech Recognition Based On Deep Neural Network

Posted on:2019-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:X K HuFull Text:PDF
GTID:2428330626952101Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Speech recognition is the key technology to realize human-computer interaction and it could promote the development of artificial intelligence.In the past few decades,many experts and scholars have invested a lot of energy in this field and have achieved many technological achievements.Today,automatic speech recognition systems are widely used in product applications.In the noiseless environment using near-field microphones,the accuracy of speech recognition for isolated words has exceeded the actual threshold.However,the study of continuous speech recognition for large-scale vocabulary has encountered a bottleneck stage.The rapid development of Internet and multimedia technologies has enabled us to access a large number of raw voice data and text corpora through multiple channels.However,if we only rely on traditional speech recognition algorithms,it is difficult to use this data effectively to construct an acoustic model with excellent performance,and the recognition result will not be very good.With the rise of deep learning technology in recent years,acoustic model based on DNN-HMM hybrid system has replaced the traditional GMM-HMM acoustic model and become the popular framework in speech recognition systems.At the same time,speech recognition in complex noise environments is also a research hotspot.Voice information that only relies on single mode is easily interfered by the noise environment and affects the recognition results of the acoustic model.The visual information is not disturbed by acoustic noise,and the voice information can be supplemented from a visual angle.Based on the above background,this paper proposes a speech recognition method for audio-visual information fusion,which combines the features of facial lip images with the features of speech,and improves the robustness and accuracy of the acoustic model.First,design a large-scale continuous Chinese corpus and use the Kinect device to record voice and image data.Then,the lip image features and phonetic features of different dimensions were selected by experiments,and the multimodal features were merged.Finally,the acoustic model modeling,training and decoding of DNN-HMM is performed on the Kaldi platform.The paper uses a small-scale Chinese corpus recorded in the laboratory to conduct experiments,we compared the results of the traditional GMM-HMM and DNN-HMM acoustic models.The experimental results show that the error rate of multi-modal acoustic model based on deep neural network is effectively reduced for words and sentences task.
Keywords/Search Tags:Audio-Vedio Speech Recognition, Acoustic Modeling, Deep Learning, Multi-modal Information
PDF Full Text Request
Related items