| As the most meaningful area in the scene image,text plays an important role in the understanding and analysis of current scene.For example,trademark text,road sign text,etc.,can cover the important content of current vision under the scene.Scene text detection technology,which aims to locate text regions in natural scenes,has also attracted more and more attention due to its wide application value.The development of deep learning techniques in recent years has provided many high-quality solutions for scene text detection task,among which instance segmentation-based methods are widely adopted due to their top-down fine detection process.However,due to the changeable text morphology and large background interference in natural scenes,instance-segmentation based methods still have some problems to be solved.This paper mainly focuses on the problem of False Positives(FPs)in instance segmentation based method,and partitions this problem into two sub-problems:(1)FPs within text region.Instance segmentation-based methods perform text proposal prediction in the first stage,and pixel-by-pixel classification inside the proposal box in the second stage.However,low-dimensional spatial features are poorly distinguishable.Thus,background pixels are prone to be misjudged as foreground pixels during classification,causing false positive prediction results within text areas,which specifically manifests as inaccurate text boundary detection.(2)Text instance FPs.There exists complex background noises in natural scenes,and some of the background spatial features are highly similar to the text region spatial features.Due to the lack of context information in the process of network modeling,such background features are easily mis-activated in the neural network,resulting in text instance FPs,representing as misjudgment of pure background regions.To solve the above problems,this paper conducts research from three levels:pixel level,instance level,and region level.The specific research contents are as follows:(1)Pixel-level semantic association modeling.Aiming at the problem of FPs within text regions,this paper imposes additional constraints on pixel-level category features in a high-dimensional embedding space,and correspondingly designs a pixel embedding branch network.Based on the metric learning loss function,pixel embedding branch aggregates the representation vectors of boundary class features while separating the representation vectors of boundary class and non-boundary class pixels to achieve class distinction in high-dimensional embedding space.In addition,during the inference stage,this paper proposes a noisy point suppression algorithm,which performs secondary filtering on candidate boundary points based on feature metrics in the embedding space.(2)Instance-level semantic association modeling.Through further analysis,this paper finds that the pixel-level local modeling method is limited by the representation of local features,so the overall network optimization stability is insufficient.Considering that some of the different text instances in the same image have the same semantic commonality such as color,background,font,etc.,this paper further starts from the perspective of instance-level semantic association modeling.By modeling the visual feature association of any pair of text instances,it improves the representation of text distinguishable features and maintains the stability of network training.(3)Region-level semantic association modeling.Aiming at the problem of text instance FPs caused by the lack of context information in the process of network modeling,this paper proposes a context enhancement module from the perspective of region-level semantic association modeling.This module achieves contextual information enhancement of initial features through two aspects of spatial feature enhancement(SFEM)and channel information fusion(CIFM).SFEM captures global dependencies in multiple feature subspaces and weights the initial features.While,CIFM selects channel features by adaptively fusing the channel descriptors of global max pooling and average pooling.Based on the above research,this paper validates the effectiveness of the scheme on three general and challenging scene text datasets,and further conducts ablation experiments and model comparison analysis.The experimental results show that the proposed method can effectively improve the inaccuracy of text boundary detection and further suppress false positive cases in the detection results,which also has obvious advantages compared with existing methods. |