| Lipreading is a way of using a computer to move the lips of the speaker without the help of voice information.The technology that analyzes visual information to identify the content of speech,which converts the visual information of the speaker’s lips into text information,and can be widely used in speech recognition,deaf-mute communica-tion,identity recognition and other fields.Lipreading can be divided into two categories:word-level and sentence-level.The research on word-level lipreading technology is rel-atively mature.Sentence-level lipreading technology is the final form of lipreading,and it is also the current hot spot of lipreading technology research.Therefore,this thesis mainly research sentence-level lipreading methods.At this stage,the research on sentence-level lipreading still has problems such as low efficiency of visual feature information extraction,slow training convergence,and low recognition accuracy.In re-sponse to these problems,this thesis proposes a high-performance and high-efficiency lipreading network architecture based on deep neural networks based on the theory of deep learning methods.Experiments were conducted on the public lipreading OuluVS2 and GRID datasets.The experiments proved that the proposed architecture can effec-tively improve the feature extraction efficiency,accelerate the convergence speed of the model,and finally improve the recognition accuracy.The research content of this thesis is as follows:(1)Propose two core architectures of the lipreading method,the Encoder-Decoder framework and the CNN-RNN framework.Based on the framework of lipreading at this stage,the editor The Encoder-Decoder framework and the CNN-RNN framework are compared and studied on the public lipreading dataset.(2)A lipreading architecture based on an improved feature extraction network is designed,which is mainly composed of two modules: front-end feature extraction and back-end sequence feature fusion.The front-end combines 3D convolutional neural network and MouthNet densely connected layer neural network to extract features,and the back-end uses a bidirectional LSTM network based on the RNN architecture for se-quence feature fusion.Finally,experiments were performed on the OuluVS2 and GRID datasets.The experiment used Connectionist Temporal Classification(CTC)function as the objective loss function,the results show that the architecture improves the efficiency of feature extraction and recognition accuracy.(3)A lipreading architecture based on an improved feature fusion network is de-signed.The front-end uses 3D convolutional neural network and ResNet50 residual convolutional neural network as the feature extractor,and the back-end uses temporal convolutional network(TCN)for feature fusion and feature classification.Experiments on the GRID dataset also use the Connectionist Temporal Classification(CTC)function as the objective loss function.The results show that the proposed architecture is superior to the traditional RNN architecture in terms of recognition accuracy and convergence speed. |