| The text in the scene image has rich high-level semantic information,which provides important assistance for the computer to understand the scene image.Text detection and recognition in the scene image has attracted increasing interests of researchers.This paper studies both scene text detection and scene text recognition,and proposes respectively the corresponding network models;furthermore,a text recognition model is presented based on knowledge distillation for reducing the parameters and calculations of the network model.First of all,in the text detection task,the text in the image is regarded as a special kind of object,and the text detection based on the existing general object detection algorithm is one of the current mainstream methods.However,the current detection algorithms have weak angle regression capabilities,and it is difficult to accurately estimate text orientations when multidirectional scene text exists.this thesis proposes a deep convolutional network model which can automatically detect oriented texts.This model introduces a spatial transform module is designed for learning text orientation,and further proposes feature alignment to learn and improve the oriented bounding boxes of text.The method avoids presetting a large number of anchor boxes with different angles and different aspect ratios,and greatly reduces the complexity of the model.The experimental results show that the proposed detection model effectively improves the performance of the baseline model,achieving state-of-the-art performance on both ICDAR-2015 and MSRA-TD500 datasets.Secondly,in text recognition tasks,long and short-term memory networks(LSTM)are often used for sequence modeling of features extracted by convolutional neural networks(CNN).However,LSTM uses a serial approach to model sequence features,which will cause a certain amount of information loss when modeling long-range features.To this end,this paper introduces the attention mechanism into the sequence modeling of text features;the long-range dependencies can be modeled based on learning of similarity of different texts and performance can be further boosted by stacking of multiple attention modules.Experiments have shown that the proposed method is more effective than LSTM,significantly improving the recognition accuracies and being very competitive compared to state-of-the-art methods.Finally,this paper proposes a scene text recognition network based on knowledge distillation,which transfers the knowledge of one large,complex network model as a teacher to a student,light-weight model with small number of parameters.For the feature extraction part,this thesis uses the middle layer feature distillation method to supervise the training of the student model;for the sequence modeling part,the similarity matrix and the output of the feedforward network are used as the supervision signal to train the student model.The experiments have shown that the proposed method can significantly improve the recognition accuracy of the light-weight student model,achieving comparable performance with or better performance than the teacher model. |