| Text in natural scene images or videos usually carries important semantic information.Scene text recognition aims to automatically recognize and extract text from real natural scene images,and is widely used in traffic monitoring,multimedia retrieval,and semantic understanding of natural scenes.Scene text recognition is a practical and challenging task.Compared with text recognition in documents,there are various challenges in natural scene text recognition,including the complexity of scenes,the diversity of texts,and more stringent practical requirements.Due to the design and shooting angle,there is a large amount of irregularly shaped text in natural scenes.Yet,existing convolutional neural network-based scene text recognition algorithms have difficulty capturing the semantic information of text in images.In addition,the images may be blurred or low-resolution due to the hardware condition of the shooting equipment.Due to the lack of sufficient details,these images can easily lead to erroneous results.Traditional super-resolution methods mainly focus on reconstructing the detailed textures of raw images,however,these methods usually do not work well with the text due to their unique characteristics.Finally,due to the significant differences in the needs of text recognition models for different scenarios in daily life and industrial production,building prototype systems for model inference covering the needs of multiple scenarios is also an important research task.In this thesis,we focus on the task of scene text recognition and devote ourselves to solving the text recognition problem in complex scenes and realizing the prototype system of cloud-side collaboration.The research content of this thesis is as follows.1.To address the difficulty of existing convolutional neural network-based modeling of horizontal and vertical text semantic information in images,this thesis proposes a new end-to-end scene text recognition network,in which the HVBi LSTM module(Horizontal and Vertical Bi-directional Long-Short Time Memory Network)is used to extract image features and model the association between text characters.The HVBi LSTM module serializes the input image block embeddings horizontally and vertically,then models the dependencies in their sequence directions by bidirectional LSTM,and finally connects the two features and performs channel fusion.The dependency modeling approach of the HVBi LSTM module fits the characteristics of text recognition tasks and can effectively capture the semantic information of text in images.This thesis demonstrates the effectiveness of the HVBi LSTM module through comparative experiments and shows that the improved model has the advantage of accuracy performance by comparing it with other classical methods.2.This thesis proposes a lightweight image super-resolution model,DCSR,which can be used in the pre-processing stage of scene text recognition to reconstruct the image and thus reduce the difficulty of subsequent text recognition.To reduce the extra computational cost,the DCSR model uses the mechanisms of sparse mask generation and dynamic convolution to skip the computation of partially redundant regions.The experimental results show that the proposed DCSR model can effectively improve the accuracy of the scene text recognizer and has excellent computational efficiency.3.A prototype system supporting cloud-edge collaboration is designed to enhance the breadth of its application scenarios in various ways: collaborative reasoning between the cloud and the edge,functional modularity to support flexible deployment,and inference competence And a scene text recognition system is implemented based on this prototype system with the proposed two deep learning models.This thesis conducts research based on deep learning and proposes Seq STR,an endto-end scene text recognition model,DCSR,a lightweight text super-resolution model,and a prototype system for model inference with cloud-side collaboration.The method in this thesis improves the accuracy of scene text recognition and helps to promote the application of scene text recognition technology in realistic scenarios. |