| In recent years,the researches and developments of deep learni ng technology have made amazing achievements in many fields suc h as computer vision.As one of the most challenging tasks in the field of machine vision,automatic visual speech recognition technol ogy aims to identify what a speaker is saying only by observing th e lip movements of the speaker,and it also has been further develo ped with the boom of deep learning.But even so,there are still many problems to be solved in visual speech recognition due to the problems caused by a number of factors such as weakness of lip movement and the richness of language.Lip reading technology is a data-driven research,and deep lear ning algorithms are inseparable from a large amount of data as the basis for training and evaluation.However,due to the lack of data sources,the current lip recognition data set,especially the Chinese corpus,cannot meet all aspects of lip reading research.Although cu rrently there are large data sets from Chinese Academy of Sciences and Zhejiang university to study,there is still a lack of a lip readi ng dataset for everyday expressions.Considering the above situation,the first task of this project is to construct a large dataset of spok en Chinese that can be used for deep learning research,and to pro pose a integral strategy and algorithm flow for data selection and d ata processing.In addition,starting from the difficulties in lip readi ng and the blind spots not considered in the current study,a new a lgorithm model for lip reading is proposed.This model is mainly c omposed of two modules,namely the frontend and the backend,wh ich correspond to two of research focus and innovation points of th is topic.In the front module of the model,the group convolution s tructure is used to lightweight the frontend with a large number of references in the study of lip reading,and the multi-scale features of lips are learned by using dilated convolution,so as to enhance t he robustness of the model in the face of changes in lip resolution.And model in the backend module introduces a combination betwe en Recurrent Neural Network and attention mechanism.Through the use of internal and external strategy of combination of attention,t his model can not only focus on the areas between the input and o utput,but also the internal structure correlation of the sentence itsel f,which,to a certain extent,relieve the poor performance when m odel tries to identify a too short phrase due to weak context.In ad dition,based on the current information,this paper,for the first tim e,analyzes the effect of sentence length on the model recognition performance in the study of automatic sentence-level recognition of mandarin.In order to verify the commonality of this proposed model in 1 ip reading task for different datasets,in addition to experimenting o n the mandarin dataset designed in this topic,multiple large outdo or English datasets that can be publicly used have also been utilize d to make implement experiment and comparative analysis between different algorithms.We obtained good results in these English data sets:On LRW dataset the classification accuracy is 85.12%,2.12%more than the current best result.Performance on LRS2 and LRS3 is close to today’s best without using the language model.At the s ame time,in order to better deal with the problems of slow conver gence and easy overfitting of the lip reading model,four training st rategies are designed to achieve more efficient training and more ge neralized performance. |