Font Size: a A A

Study On Cross-modal Speech Recognition Methods With Fusion Lipreading

Posted on:2022-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y W GongFull Text:PDF
GTID:2518306752983869Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Speech is the most direct way for humans to share information.With the development of deep learning in recent years,it has also achieved great success in speech recognition tasks,moving successfully from the laboratory to the realm of life.However,its recognition accuracy is still limited in noisy acoustic environments.And the research aiming to achieve a high recognition rate by fusing multiple modal information has become one of the hottest research topics nowadays,which involves both auditory and visual modal information in the audiovisual bimodal speech recognition task.In this paper,a new structure for audiovisual speech recognition is constructed from a bimodal perspective,as follows:1.Constructing an audiovisual dataset CCTV-LIP for visual speech recognition.Data is the basis of research.At present,most of the existing audiovisual datasets are in English,and there are few open-source Chinese audiovisual datasets of medium-sized categories.To solve the shortage of Chinese datasets,this paper constructs a Chinese audiovisual dataset of 350 categories.The dataset comes from CCTV news broadcast programs,short videos,and the LRW-1000 dataset,the acquisition ratio is: 5:3:2,the dataset involves several scenes with diverse contents,and the dataset is labeled by semi-automatic labeling,the dataset contains a training set,test set,and validation set.2.Proposed attentional network VG-AVFN for visually guided audio.Humans in real noisy environments usually discriminate what a speaker says more in conjunction with the speaker's mouth shape changes,which are often more reliable than the voice.We use a multilayer perceptron-based attention to map visual features and auditory features to the same space and learn audio regions associated with vision adaptively.The VG-AVFN achieves better performance in both the LRW dataset and the CCTV-LIP dataset.3.Proposed improved cross-modal attention fusion network VG-CMFSR.Most existing multimodal fusion methods use direct feature splicing methods,which greatly reduce the effectiveness of recognition.VG-CMFSR adds self-attentiveness to the audio branch and visual to filter the features within the modality and focus on the more important features.Interactive cross-modal attention fusion is also used to obtain audio contextual information with visual information as the query and visual contextual information with audio information as the query,respectively,adding residual structures to ensure the integrity of the original information structure in the process,and finally fusing contextual information.The experiments show that the performance of the LRW dataset and CCTV-LIP dataset is much better than other fusion approaches and unimodal approaches.
Keywords/Search Tags:Cross-Modal Fusion, Lip-reading, Audio-visual speech recognition, Speech recognition, Attention Mechanism, Visual-guided
PDF Full Text Request
Related items