Study On Cross-modal Speech Recognition Methods With Fusion Lipreading

Posted on:2022-11-08

Degree:Master

Type:Thesis

Country:China

Candidate:Y W Gong

Full Text:PDF

GTID:2518306752983869

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Speech is the most direct way for humans to share information.With the development of deep learning in recent years,it has also achieved great success in speech recognition tasks,moving successfully from the laboratory to the realm of life.However,its recognition accuracy is still limited in noisy acoustic environments.And the research aiming to achieve a high recognition rate by fusing multiple modal information has become one of the hottest research topics nowadays,which involves both auditory and visual modal information in the audiovisual bimodal speech recognition task.In this paper,a new structure for audiovisual speech recognition is constructed from a bimodal perspective,as follows:1.Constructing an audiovisual dataset CCTV-LIP for visual speech recognition.Data is the basis of research.At present,most of the existing audiovisual datasets are in English,and there are few open-source Chinese audiovisual datasets of medium-sized categories.To solve the shortage of Chinese datasets,this paper constructs a Chinese audiovisual dataset of 350 categories.The dataset comes from CCTV news broadcast programs,short videos,and the LRW-1000 dataset,the acquisition ratio is: 5:3:2,the dataset involves several scenes with diverse contents,and the dataset is labeled by semi-automatic labeling,the dataset contains a training set,test set,and validation set.2.Proposed attentional network VG-AVFN for visually guided audio.Humans in real noisy environments usually discriminate what a speaker says more in conjunction with the speaker's mouth shape changes,which are often more reliable than the voice.We use a multilayer perceptron-based attention to map visual features and auditory features to the same space and learn audio regions associated with vision adaptively.The VG-AVFN achieves better performance in both the LRW dataset and the CCTV-LIP dataset.3.Proposed improved cross-modal attention fusion network VG-CMFSR.Most existing multimodal fusion methods use direct feature splicing methods,which greatly reduce the effectiveness of recognition.VG-CMFSR adds self-attentiveness to the audio branch and visual to filter the features within the modality and focus on the more important features.Interactive cross-modal attention fusion is also used to obtain audio contextual information with visual information as the query and visual contextual information with audio information as the query,respectively,adding residual structures to ensure the integrity of the original information structure in the process,and finally fusing contextual information.The experiments show that the performance of the LRW dataset and CCTV-LIP dataset is much better than other fusion approaches and unimodal approaches.

Keywords/Search Tags:

Cross-Modal Fusion, Lip-reading, Audio-visual speech recognition, Speech recognition, Attention Mechanism, Visual-guided

PDF Full Text Request

Related items

1	Research On Technologies Of Audio-Visual Bimodal Speech Recognition Based On Attention Mechanism
2	Speech Endpoint Detection Based On Audio And Visual Features
3	Research On Audio-Video Information Processing Based On Lip-Changing
4	Audio-Visual Speech Recognition And Its Applications
5	Research On Audio And Video Speech Recognition In Tibetan Lhasa Dialect
6	Research On Noise Treatment Of Speech Recognition With Lip-movement Information
7	Research Of Speech Recognition Method Based On Audio-visual Information Fusion
8	Research On Emotion Recognition Of Monomodal Speech And Multimodal Speech Vision Based On Transfer Learning
9	A multimodal sensor fusion architecture for audio-visual speech recognition
10	The Methods Of Deep Audio-visual Speech Recognition