Text has been a crucial tool in preserving and advancing human civilization.It plays a vital role in conveying information across various domains of human society.For instance,text messages on cards,door signs,road signs,and tickets describe essential content in different situations,helping people understand and carry out their tasks.With the rapid advancement of information technology,text is now widely available in images and videos.The use of text recognition technology to automatically recognize and convert text in images into computer-processable text sequences has become indispensable for improving productivity.As a fundamental technology in the field of computer vision,text recognition can be applied in various practical scenarios,such as document digitization,robot navigation,intelligent logistics systems,autonomous driving,digital government construction,etc.The advanced development of deep learning has led to the creation of many deep neural network-based methods that can effectively solve the problem of text recognition.However,these methods are typically data-driven and require large amounts of annotated data to fully exploit their performance.Fortunately,the use of self-supervised contrastive learning methods can help alleviate the reliance on large amounts of labeled data by obtaining useful representations from large amounts of inexpensive unlabeled data.We present a series of studies for text recognition based on self-supervised contrastive learning methods that have shown promising results in improving accuracy.First,we propose a self-supervised method based on grouping and differentiation,to address the challenge of similar characters being easily confused in the representation space.Our method considers both intra-group and inter-group comparisons as optimization objectives.The intra-group objective requires the model to distinguish input from hard similar samples within close neighbor clusters.Meanwhile,the inter-group objective aims to ensure the distinction between different semantic groups by enlarging the separation among all cluster centroids.Through experiments on benchmark datasets of Chinese and English,we demonstrate the effectiveness of our proposed method.Second,we propose a self-supervised method based on contrast predictive coding,to capture the visual sequence information implicit in text line images.Our method requires the model to use contextual features to predict future time steps in the latent space,enabling the model to learn the global structure.To address the issue of information overlap within the feature map induced by the deep convolutional neural network encoder,we design a widthwise causal convolution during pre-training and a progressive recovery training strategy during fine-tuning to improve the performance.Experimental results show that the proposed self-supervised method can effectively enhance the performance of text recognition models in most cases.Third,we propose a self-supervised method that considers the local appearance and global sequence,to integrate information from both local and global levels.Our method utilizes grouping-based contrastive learning to obtain local appearance information and contrastive predictive coding to learn global sequence information.The method consists of three branches,each of which extracts high-dimensional features for different views and conducts both grouping-based contrastive learning and contrastive predictive coding tasks in a multi-task learning fashion.By fully integrating information from both global and local levels,our method achieves better performance than those using a single level of information.Experimental results on several Chinese and English benchmark datasets demonstrate the effectiveness of our proposed method. |