Font Size: a A A

Design And Implementation Of Multimodal Language Recognition System

Posted on:2022-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:J HeFull Text:PDF
GTID:2518306605988559Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
In recent years,speech recognition and image recognition have gradually become the mainstream way of human-computer interaction,and speech recognition has become a key factor in promoting the development of artificial intelligence.In addition,research based on speech recognition under noisy backgrounds is also gradually emerging.Although in the test environment,the recognition accuracy of isolated words has reached 99%,from a practical point of view,when we speak,there is not only the sound itself,but also the background sound produced by the surrounding environment.Therefore,the recognition accuracy is not as high as expected.A new algorithm model is urgently needed to overcome this problem.With the rapid development of deep learning,Markov model based on deep learning has gradually become the mainstream speech recognition model,replacing the traditional Gauss Markov model.Based on the above background,in order to further improve the accuracy of speech recognition,This paper presents a multimodal based language recognition system.Language recognition method based on audio visual fusion.On the basis of traditional speech recognition,visual factors of lip recognition are added.When the audio background is too noisy,lip language is used to supplement the understanding of semantics.This thesis mainly includes the following four parts:firstly,extracted the audio feature parameters.Extract the required FBanK features and MFCC features through the MFCC parameter extraction method.Secondly,the video image features are extracted,and the visual features are extracted after the video is preprocessed by frame and window.Thirdly,feature fusion.GMM-HMM model is trained as the baseline model,then convolution neural network and depth neural network are selected for feature fusion,modeling and training of visual information and auditory information in the first place.Eventually,recognition accuracy of the two network models is tested.Fourthly,make the program interface and conduct the overall test.
Keywords/Search Tags:Multimodal language recognition, Image feature extraction, Audio feature extraction, GMM-HMM, Convolutional neural network, Deep neural network
PDF Full Text Request
Related items