A Study On Bimodal Audio Visual Speech Recognition Based On Deep Learning

Posted on:2021-05-28

Degree:Master

Type:Thesis

Country:China

Candidate:J N Luo

Full Text:PDF

GTID:2518306548481844

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In recent years,with the vigorous development of deep learning,many topics based on deep learning have become hot research objects.Modal recognition problem is one of the hot topics.It aims to classify by single or multiple modal sequences and then learn the correspondence of different modalities.Content,and finally output as text content.Among them,the main modalities are auditory and visual.At present,bimodal recognition(auditory and visual)is still in development due to insufficient data sets,language diversity,and speaker habits.Starting from the mathematical definition of pattern recognition,this paper mathematically models the problems raised,constructs a dual-modal audio-visual architecture,and proposes a new type of audio-visual recognition structure.The traditional model is to completely pass the timing dependence to the back end.To complete,often ignore the differences in short-term dependence due to different speaker habits under the wild dataset.The model proposed in this paper strengthens the learning ability of shortterm dependence of features,improves the recognition effect of the model,and also has obvious advantages in experimental performance.The number of model parameters of the single visual mode is reduced by nearly 1/2.This article uses public data sets: wordlevel data set LRW and sentence-level data set GRID.By studying the characteristics of these two data sets,the audio-visual network model is adjusted for different data sets,and two audio-visual recognition models are proposed,and The details of the model and the training method are also elaborated,including shallow feature extraction,feature fusion,CTC alignment mechanism and Fine-tuning fine-tuning mechanism.This paper validates the model of this paper on these two data sets.The classification accuracy of pure lip image sequence on LRW reached 83.1%,and the classification accuracy of audiovisual results reached 98.16%,which exceeded the current highest result of 0.16%;on GRID,compared with the original results,the word error rate was two Under these conditions,they decreased by 0.29% and 0.41% respectively.

Keywords/Search Tags:

Deep Learning, Bimodal, Audio Visual Speech Recognition, Feature Fusion

PDF Full Text Request

Related items

1	Research On Technologies Of Audio-Visual Bimodal Speech Recognition Based On Attention Mechanism
2	Bimodal Speech Recognition Technology Research Based On Audio And Video
3	Bimodal Emotion Recognition Based On Deep Learning
4	Speech Endpoint Detection Based On Audio And Visual Features
5	Research On Expression And Speech Bimodal Emotion Recognition Of Children
6	Bimodal Emotion Recognition Based On Facial Expression And Speech
7	Research On Noise Treatment Of Speech Recognition With Lip-movement Information
8	Research On Two Typical Speech Processing Applications Based On Deep Learning
9	Deep Learning-Based Bimodal Emotion Recognition From Facial Expression And Body Posture
10	Research Of Speech Recognition Method Based On Audio-visual Information Fusion