Text is the cornerstone of human information transmission,and its invention and creation are the starting point of human civilization.In recent information era,a huge amount of data exists in human daily communication,and texts are also presented around us in various forms.The saying that painting and calligraphy are of the same origin has proved that there is an inseparable relationship between texts and images.With the development of computer technology,scene text image analysis is conducive to information extraction,scene analysis and other applications.However,due to the diversity of text geometry and the complexity of image backgrounds,the conventional image analysis technology is sub-optimal to accurately capture the text area,and makes it difficult to achieve accurate text content understanding and processing.Therefore,scene text image analysis technology has gradually become a research hotspot.The scene text image analysis technology focuses on the textual instances,aiming at extracting the text content,and identifying the textual authenticity,including scene text image localization,recognition and tampered detection.Due to the inconsistent activation of multi-scale text features and difficulties in network optimization,inaccurate modeling of language information and insufficient fusion,and difficulties in learning and inaccurate identification of tampered texture features,the performance of current scene text image localization,recognition and tampered detection algorithms is limited.Therefore,this thesis conducts research around the above three key technologies,and the main work is listed as follows:1.The research on scene text image localization based on multi-scale feature learningThe large variance in text scale results in inaccurate multi-scale text localization.Due to different application scenarios,it is necessary to design efficient or accurate multi-scale text localization algorithms for specific application scenarios.On the one hand,the current text localization algorithm based on the multi-stage prediction framework will introduce a large number of redundant prediction results,which limits its computational efficiency in high-response scenarios.On the other hand,for text localization in high-precision scenes,the huge change of smooth L1 value caused by text scale variance affects the region proposal network convergence.Based on the feature pyramid network,this thesis first studies the multi-scale feature propagation network,and explores an efficient multi-scale text localization algorithm in high-response scenarios by consistently activating the multi-scale text features.Further,this thesis constructs a multi-scale region proposal network based on the feature pyramid network,and explores an accurate multi-scale text localization algorithm in high-precision scenarios,which introduces a scale-invariant IoU optimization strategy and adaptive text shape perception.2.The research on scene text image recognition based on visual language modelingThe insufficient visual information of characters makes it difficult to recognize occluded text images.As the number of occluded scene text images is small and it will cause huge human cost in manually labeling occluded text images,it is difficult for networks to learn accurate language modeling under occlusion with limited samples.Furthermore,based on the linguistic information from occlusion samples,the insufficient aggregation of visual and linguistic information will also limit the performance of occluded text image recognition.This thesis first explores the automatic generation algorithm of character-level occlusion mask based on weakly-supervised complementary learning(WCL).WCL guides the character-level mask map learning under only wordlevel supervision,and helps the network model accurate language information under occlusion situations.Secondly,using the character-level occlusion mask,this thesis constructs a visual language modeling algorithm based on the masked language model(MLM).MLM introduces the training process of occlusion text feature inference,guiding the vision model to initiatively capture the context language information of characters in the visual space.During the testing process,the recognition network achieves accurate occluded text image recognition by adaptively fusing the visual and linguistic information.3.The research on scene text image tampered detection based on local texture difference modelingThe low discrimination between tampered and real-world texts results in inaccurate text image tampering detection.This thesis defines a new tampered scene text detection task,which localizes all text instances in the image and identifies the authenticity of each text.Due to the lack of high-quality tampered text images,it is difficult for tampered detection network to learn robust tampered textual features in the limited number of samples.In addition,it is difficult for networks to perceive fine-grained local texture differences between tampered and real-world texts.Firstly,this thesis studies a tampered text image generation algorithm based on the progressive region-based text eraser,generating high-quality tampered text images and providing a foundation for subsequent research on the tampered detection task.Furthermore,based on the high-quality tampered text image dataset,this thesis proposes a Separating Segmentation-Sharing Regression(S3R)network modification strategy to help the general localization network migrate to tampered detector,and construct a general tampered scene text detection pipeline.In addition,a parallel feature extractor is proposed to model the local texture features of tampered and real-world texts in the frequency domain,maximizing the local texture differences between the two-categories texts.Finally,this thesis demonstrates the effectiveness of the proposed method on the scene text image analysis task.This thesis conducts a systematic and targeted research on the key technologies of scene text image analysis,and has achieved certain improvements in practical application. |