Font Size: a A A

Tibetan Lip Recognition Based On Deep Learning

Posted on:2022-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:H ZengFull Text:PDF
GTID:2518306500456344Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Lip reading involves images,natural language processing and other fields,and has become a research hotspot in the field of computer vision.By modeling temporal and spatial information from continuous Lip image sequences,feature information can be obtained,to learn the corresponding text content.Lip language recognition can be used in a variety of scenarios,which can be used to help the hearing impaired communicate,for speech recognition in noisy environments,and for military and public security investigations and crime solving.However,in order to obtain accurate lip language recognition,many difficulties need to be overcome.When people speak,different postures and different lighting conditions will affect the recognition of lip language.Different language content will bring about various changes in lip movement and increase the difficulty of lip language recognition.This As a result,the research progress on lip recognition is slow.Deep learning research is inseparable from data.Currently,there are few public lipreading data sets and mainly English data sets.First,it reviews the current mainstream lip language recognition data sets.Secondly,our country is a multi-ethnic country with rich languages.In order to lay a good foundation for Tibetan lip language recognition,this paper constructs the first Tibetan word-level lip reading data set,named TLRW-50.The text of the data set contains 50 class of commonly used Tibetan words are stored as a series of lip-shaped picture sequences after data preprocessing.Six image data enhancement methods including color enhancement,adding Gaussian noise,horizontal mirroring,zooming in,rotating,and cropping are used to expand the sequence of lipshaped picture frames.Before data expansion,the lip recognition video was evaluated subjectively.Combining the difficulties of lip recognition,the D3 D algorithm is applied to Tibetan lip recognition.The model improves the feature extractor and changes the spatial convolution in Dense Net to spatio-temporal convolution,which improves the model's short-term dependency modeling ability,Connectivist temporal classification(CTC)loss is adopted to make network autonomous learning,and the input and output sequences can be aligned when decoding.Through a large number of experiments,Finally,the classification accuracy of top-1 and top-5 on LRW-1000 is 34.28% and 50.26%,respectively.The accuracy of top-1 and top-5 on TLRW-50 was 39.65% and 56.73%,respectively.It shows that this method can be used to realize Tibetan lip recognition.The deformed stream network is used to capture the change information of the face movement,and the deformed stream of the face in the Tibetan lip-reading video is generated for training in a self-supervised manner.In order to improve the lip-reading effect,the deformed stream and the original video are used as the input of the dual-stream network for calculation,Each branch independently predicts the probability of each type of word.In order to exchange information between the two branches during the training process,knowledge distillation is used,and a two-way knowledge distillation loss is used to help the two branches learn each other's predictions during the training process.During the test,we merge the predictions from the two branches to make the final prediction.We observe that compared with the results of using any single branch,the test results of the fusion of the two branches have a higher classification accuracy.It shows that the two input sources,the original video and the deformed stream,provide complementary clues for the lip-reading task.
Keywords/Search Tags:Tibetan, Lip recognition, Lip reading dataset, Data expansion, Deep learning, Deformable flow network
PDF Full Text Request
Related items