Font Size: a A A

Lip-reading Recognition Based On Spatio-Temporal Convolution And Bidirectional GRU

Posted on:2020-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ShenFull Text:PDF
GTID:2428330590477043Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Lipreading is a technique of recognizing speech content only according to the visual information of the speaker's lip movement,which has been widely used in lip interactive controlling,mute message inputting,speech recognition in noisy environment,and silent video recognition.It also have a great significance in research about auxiliary authentication and public security area,as well as helping the deaf communication.However,lipreading is a very difficult task for human beings.Traditional machine learning methods or models are timeconsuming in extracting the lip movement features and have poor recognition effect.What's more,there are very few Chinese lipreading datasets available.The data is limited,so is the application value.In order to solve these problems in Chinese lipreading,the main idea of this thesis is to build a large number of Chinese lipreading datasets,use the multi-STCNNs and multi-Bi-GRUs to extract lip features,and train end-to-end,so as to implement sentence-level Chinese lipreading.The main research contents and contributions of this thesis are as follows:(1)First,this thesis implements a client named ‘Lipreading Video' based on ios system to construct the Chinese lipreading datasets.This client allows different users to record liplanguage video data.Users can record the corresponding lip-language video according to the text displayed on the client,and choose to review,re-record,or upload video to the server.The system uses Voice activity detection algorithm(VAD)to detect and segment the collected liplanguage video data,and automatically marks the start and end timestamps of each word spoken by the speakers.Then it uses the AdaBoost cascade classifier based on haar-like features to detect the face,locate the face and extract the lip region.This scheme can mark lip-language video data in batch,and saves a lot of labor.(2)This thesis proposes the ‘ChineseLipNet' model,an end-to-end model based on multiSTCNNs and multi-Bi-GRUs.For the input lip-language video data,multi-STCNNs will first extract the features and the Max-Pooling layers can reduce the dimension,these steps can extract good features without any manually marking.Then multi-Bi-GRUs will process these features and learn to predict or recognize the sequences.The multi-Bi-GRUs enable our model to learn information about current time-step as well as future time-step.Then a fully connected layer and softmax layer will predict the output.This thesis tests the ChineseLipNet model and compare it with human beings,AlexNet model and VGG model.The experimental results show that the accuracy of ChineseLipNet model is significantly higher than that of human lipreading,and better than AlexNet model and VGG model.At the same time,ChineseLipNet model has fewer parameters,shorter training time and faster convergence speed.Therefore,ChineseLipNet model is not only suitable for training large-scale lipreading datasets,but also more suitable for applying to portable terminal equipment for recognition,which has higher application value.
Keywords/Search Tags:Deep learning, lip recognition, Chinese lipreading, STCNN, Bi-GRU
PDF Full Text Request
Related items