Multi-modal Speech Recognition Based On Deep Neural Network

Posted on:2019-07-10

Degree:Master

Type:Thesis

Country:China

Candidate:X K Hu

Full Text:PDF

GTID:2428330626952101

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Speech recognition is the key technology to realize human-computer interaction and it could promote the development of artificial intelligence.In the past few decades,many experts and scholars have invested a lot of energy in this field and have achieved many technological achievements.Today,automatic speech recognition systems are widely used in product applications.In the noiseless environment using near-field microphones,the accuracy of speech recognition for isolated words has exceeded the actual threshold.However,the study of continuous speech recognition for large-scale vocabulary has encountered a bottleneck stage.The rapid development of Internet and multimedia technologies has enabled us to access a large number of raw voice data and text corpora through multiple channels.However,if we only rely on traditional speech recognition algorithms,it is difficult to use this data effectively to construct an acoustic model with excellent performance,and the recognition result will not be very good.With the rise of deep learning technology in recent years,acoustic model based on DNN-HMM hybrid system has replaced the traditional GMM-HMM acoustic model and become the popular framework in speech recognition systems.At the same time,speech recognition in complex noise environments is also a research hotspot.Voice information that only relies on single mode is easily interfered by the noise environment and affects the recognition results of the acoustic model.The visual information is not disturbed by acoustic noise,and the voice information can be supplemented from a visual angle.Based on the above background,this paper proposes a speech recognition method for audio-visual information fusion,which combines the features of facial lip images with the features of speech,and improves the robustness and accuracy of the acoustic model.First,design a large-scale continuous Chinese corpus and use the Kinect device to record voice and image data.Then,the lip image features and phonetic features of different dimensions were selected by experiments,and the multimodal features were merged.Finally,the acoustic model modeling,training and decoding of DNN-HMM is performed on the Kaldi platform.The paper uses a small-scale Chinese corpus recorded in the laboratory to conduct experiments,we compared the results of the traditional GMM-HMM and DNN-HMM acoustic models.The experimental results show that the error rate of multi-modal acoustic model based on deep neural network is effectively reduced for words and sentences task.

Keywords/Search Tags:

Audio-Vedio Speech Recognition, Acoustic Modeling, Deep Learning, Multi-modal Information

PDF Full Text Request

Related items

1	The Research On Children's Speech Acoustic Modeling Based On Deep Learning
2	Reasearch Into Speech Recognition Based On Deep Learning
3	Reasearch Into Speech Recognition Application Based On Deep Learning
4	Speech Emotion Recognition Based On Deep Learning
5	Research On Speech Recognition Method Based On Deep Learning
6	Research On Emotion Recognition Of Monomodal Speech And Multimodal Speech Vision Based On Transfer Learning
7	Research On Acoustic Modeling In Low Resource Speech Recognition Based On Transfer Learning
8	Research On Speech Preprocessing Of Speech Recognition For Multi-talker Conversations In Complex Acoustic Environments
9	Research On Acoustic Model Of Speech Recognition In Educational Scene Based On Deep Learning
10	Research On Audio-Video Information Processing Based On Lip-Changing