| Lip recognition is the process by which a computer learns the changes in dynamic lip image sequence of the speaker’s lips to recognize the content of the language spoken by the speaker.Lip recognition technology is widely adopted in the fields of national public safety protection,medical correction,etc.Most of the current research uses deep learning for lip recognition,however,in order to achieve high recognition rates,the network models for lip recognition are getting larger and larger,which are difficult to deploy to mobile devices.Therefore,in this thesis,we optimize and improve the traditional MobileNet network to address the above problems,and propose a more lightweight FD-MobileNet network,and use the combination of FD-MobileNet network and GRU network to recognize the two-dimensional image features and temporal features of lips.To further improve the recognition rate,our work incorporates the attention mechanism into the GRU network and demonstrates the effectiveness of this model through a large number of experiments.Finally,we design an application system with interface function to make lip recognition implemented in real life.The main research contents are as follows:(1)Studying lip-movement video sequences.In this thesis,a semi-random frame extraction algorithm is designed to extract frames from the lip-movement video,then68 key points of the Dlib library is utilized to locate the face in the extracted images,and finally segmented using the geometric location feature points of the lips.This study not only facilitates the subsequent neural network for feature extraction,but also greatly reduces the redundant information of the image.(2)Optimizing the MobileNet network model.In this thesis,by analyzing and comparing the lightweight network models MobileNet and Shuffle Net,we can see that the basic network module of MobileNet is simple,which gives it fast prediction speed,while the fast down sampling strategy adopted by Shuffle Net can learn more information with less computational cost.Therefore,this thesis optimizes the network structure of MobileNet and proposes a new network model that takes into account both the computational cost and the prediction speed: FD-MobileNet.through experimental comparison,we find that FD-MobileNet is better than MobileNet in terms of prediction accuracy and has a huge improvement in actual prediction time than Shuffle Net.(3)Constructing a lip recognition model that combines FD-MobileNet and GRU networks.FD-MobileNet can extract the two-dimensional features of images,and GRU network can learn the action changes between sequences.Our work relies on the advantages of these two networks,and proposes a combination of FD-MobileNet and GRU for lip recognition.In order to learn the image features more accurately,the attention mechanism is incorporated into the GRU network.Finally,this thesis shows that the proposed model has strong intra-class similarity and inter-class variability,and can perform the prediction task of video lip recognition well by introducing six indexes:loss function,accuracy of test set,performance of common models on self-made dataset,confusion matrix of words,recall rate and lip variability.In terms of performance,it is found that the introduction of fast down sampling strategy and attention mechanism reduces the redundant information of the video and meets the requirement of noise suppression.(4)Developing a lip recognition application system.Considering the real-life needs,we designed a lip recognition system with user interface,which contains three modules of selecting videos,visual display and recognition results,not only to make lip recognition implemented in real life,but also to provide researchers with a tool for subsequent improvement and optimization of the model. |