Font Size: a A A

The Methods Of Deep Audio-visual Speech Recognition

Posted on:2019-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:C L TianFull Text:PDF
GTID:2428330596956567Subject:Information and Signal Processing
Abstract/Summary:PDF Full Text Request
Speech recognition is the basic core issue of artificial intelligence and natural language processing,especially during the boom of deep learning in the last decade.However,speech recognition still faces many challenges and deficiencies,such as noise environment,disturbance and confusion of dialects.To address these deficiencies,the researchers proposed robust speech recognition and presented four solutions: based on feature space,based on signal space,based on model space and based on multimodal information in which the main direction is audio-visual speech recognition(AVSR).AVSR refers to the use of natural relevance of visual information and auditory information in speech recognition.Visual information is added to speech recognition to improve the robustness of speech recognition.After several decades of research and exploration,AVSR has made great progress,but there is an incomplete use of temporal information in AVSR,and the relationship between multimodal feature expressions is not considered.Combining deep neural network,this thesis mainly discusses the related methods of deep AVSR,and mainly studies in two aspects.In view of the incomplete use of temporal information in most AVSR,this thesis proposes the deep temporal architecture based on unsupervised learning and supervised learning.In particular,our work divides the fusion into 3 phases: modal fusion,temporal modal fusion and temporal fusion.1.The visual information and speech information are preprocessed,and the visual features and auditory features are obtained by using deep convolutional neural network and short-time Fourier transform.2.Multimodal deep autoencoder network is used for modal fusion of visual features and auditory features.3.The stacked recurrent neural network is used to further fuse the features after the modal fusion,but this fusion process will better consider the temporal factors.4.The multi-temporal features are merged into a single feature using recurrent network and pooling process.The quantitative evaluation of AVSR and cross modality speech recognition on AVLetters2,AVDigits,CUAVE and AVLetters databases demonstrate the effectiveness of the proposed model.Considering the relationship between audio feature,visual feature and audiovisual feature expression,this thesis presents an end-to-end fusion & recognition model based on multimodal gated recurrent network and auxiliary loss.Feature extraction and data augmentation are the precondition of fusion & recognition.The auxiliary loss multimodal GRU based on gated recurrent network model proposed in this thesis is adopted,and a new loss function,auxiliary loss,proposed in this paper is used to train this network.In particular,auxiliary loss considers the relationship between audio feature representation,video feature representation,and audio-visual feature representation.The quantitative evaluation of AVSR and cross modality speech recognition on AVLetters2,AVDigits,CUAVE and AVLetters databases demonstrate the effectiveness of the proposed algorithm and data augmentation.
Keywords/Search Tags:Auido-visual speech recognition, Computer vision, Speech recognition, Deep neural networks
PDF Full Text Request
Related items