Lip reading technology is one of the important research directions in the field of video understanding in computer vision,and its purpose is to identify the content expressed by main characters through the dynamic changes of lips visual images.Traditional image processing methods require a lot of manual design and empirical processing on lip reading technology based on different scenes,which has certain limitations and it is difficult to meet uniformity.With the development of deep learning.In vision-based automatic lip reading technology,deep learning can separate the background and target foreground in different scenes,and improve the high degree of uniformity of models and technical routes.Users can capture lip movements to analyze real-time thinking of target person.Deep learning technology avoids complicated and tedious image processing,difficult to train classifiers,and high-experience feature extraction,which makes it applicable to different scenarios with the same method and idea.This article focuses on the use of deep learning technology in the field of lip reading.A neural network architecture that combines Convolutional Neural Network(CNN)and Recurrent Neural Network(RNN)based on attention mechanism is proposed and applied in lip-reading system.The specific work in the system are as follows:(1)Preprocessing steps such as frame extraction and mouth positioning and segmentation of the original video.First,we use an independently designed fixed frame extraction method to complete the task of converting the video into a sequence of images.Then,according to 68 facial key points of face in Dlib toolkit,the key feature coordinates of lips are detected directly,and the region of interest of lips is segmented according to the extremum of four directions.At this point,the pre-processing steps from the video to the mouth sequence image are completed.(2)Lightweight CNN to complete image feature extraction.In this paper,the lightweight network Mobilenet commonly used in industry and the pre-trained parameters of Image Net are used to complete the extraction of lip image features.The last layer of the convolution in the feature extraction network is improved to increase the Global Average Pooling Complete dimension reduction and feature extraction.This dimension reduction method is efficient,feature compression is significant,and robust.(3)The RNN completes the extraction of sequence context features,and effectively reduces the negative impact on recognition results by assigning attention to the redundant information in the sequence.After the CNN extraction of single-frame image features is complete,all frame feature sequences need to be stitched together to input the context features of the learning sequence in the RNN.In this paper,the principle of attention mechanism based on encoder-decoder is studied in depth,and it is improved and successfully applied to CNN-RNN fusion neural network.More deeply,this paper compares the performance of general CNN-RNN models by controlling variables,and verifies that the neural network model in this paper suppresses redundant information in actual recognition videos and improves model performance.(4)Design and implementation of lip-reading system.This paper combines the CNN of image feature extraction and the RNN network of sequence relationship processing in deep learning technology,and adds the understanding of attention mechanism in encoder and decoder ideas.They are jointly applied to the study of lip reading and construct lip-reading system.Verification and testing are completed through laboratory-made video datasets.The experimental results show that the fused neural network model proposed in this paper has good performance in practical applications,and the lip-reading system designed and implemented at the same time works well in personal computers. |