Font Size: a A A

A Study On Bimodal Audio Visual Speech Recognition Based On Deep Learning

Posted on:2021-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:J N LuoFull Text:PDF
GTID:2518306548481844Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the vigorous development of deep learning,many topics based on deep learning have become hot research objects.Modal recognition problem is one of the hot topics.It aims to classify by single or multiple modal sequences and then learn the correspondence of different modalities.Content,and finally output as text content.Among them,the main modalities are auditory and visual.At present,bimodal recognition(auditory and visual)is still in development due to insufficient data sets,language diversity,and speaker habits.Starting from the mathematical definition of pattern recognition,this paper mathematically models the problems raised,constructs a dual-modal audio-visual architecture,and proposes a new type of audio-visual recognition structure.The traditional model is to completely pass the timing dependence to the back end.To complete,often ignore the differences in short-term dependence due to different speaker habits under the wild dataset.The model proposed in this paper strengthens the learning ability of shortterm dependence of features,improves the recognition effect of the model,and also has obvious advantages in experimental performance.The number of model parameters of the single visual mode is reduced by nearly 1/2.This article uses public data sets: wordlevel data set LRW and sentence-level data set GRID.By studying the characteristics of these two data sets,the audio-visual network model is adjusted for different data sets,and two audio-visual recognition models are proposed,and The details of the model and the training method are also elaborated,including shallow feature extraction,feature fusion,CTC alignment mechanism and Fine-tuning fine-tuning mechanism.This paper validates the model of this paper on these two data sets.The classification accuracy of pure lip image sequence on LRW reached 83.1%,and the classification accuracy of audiovisual results reached 98.16%,which exceeded the current highest result of 0.16%;on GRID,compared with the original results,the word error rate was two Under these conditions,they decreased by 0.29% and 0.41% respectively.
Keywords/Search Tags:Deep Learning, Bimodal, Audio Visual Speech Recognition, Feature Fusion
PDF Full Text Request
Related items