Font Size: a A A

Research On Continuous Automatic Lip-reading Technology Based On Deep Learning

Posted on:2023-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:K LiFull Text:PDF
GTID:2558307061461174Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Language is one of the most important mediums of information transmission in human social communication.Although most people use auditory cues to understand language,visual cues convey a great deal of information as well,Automatic Lip-reading(ALR)is a means of understanding human language using visual cues.With the development of deep learning and the availability of large-scale lip-reading datasets,more and more researchers are focusing on ALR using purely visual information.ALR technology has a wide range of applications in the fields of human-computer interaction,security verification and public safety,and is of great research value.Most of the current research are focusing on word-level tasks or requires extradata for training,and cannot perform ALR on continuous utterances effectively,making it still some way off from being practical.Therefore,the main focus of this paper is to design a deep neural network model for continuous ALR to mimic the human lipreading ability.The main work and innovations accomplished in this paper are:First,this paper designs an image feature extraction module based on a dual spatial-channel attention mechanism and a Deep Residual Network with 3D header convolution(3D-Res Net),which has a better feature extraction effect for lip region images.The module is combined with a Bidirectional Gate Recurrent Unit(Bi-GRU)network based on a temporal attention mechanism for feature reduction to build an ALR technology architecture for individual words.The model was trained and evaluated on two word-level lip-reading datasets,LRW and LRW-1000,and achieved competitive results in comparison with recent major work.Second,based on the image feature extraction module described above,this paper designs two architectures for sentence-level ALR techniques using different temporal modelling approaches.One is based on Bi-GRU,with decoding and loss computation via the Connectionist Temporal Classification(CTC)algorithm;the other is based on a self-attention mechanism with an encoder-decoder structure for temporal modelling,with beam search-based decoding and label smoothing-based KL divergence loss.The two architectures are trained and evaluated in comparison on two major sentence-level lip-reading datasets,GRID and CMLR,through improved strategies such as independent word pre-training,learning rate warm up,and incremental training sequence length.Compared to baseline,the model based on self-attention mechanism with the encoder-decoder structure showed a 1.21% reduction in CER on CMLR dataset.Thirdly,this paper designs an optical flow-based lip activity detection algorithm,which detects the location of utterance breaks by calculating the threshold of optical flow changes between adjacent frames,so as to perform utterance segmentation on lipreading videos that are continuously fed into the model,and works with the trained sentence-level ALR model to finally achieve continuous automatic lip-reading.
Keywords/Search Tags:Deep Learning, Automatic Lip-Reading, Deep Residual Network, Attention Mechanism, Encoder-Decoder
PDF Full Text Request
Related items