Font Size: a A A

Research And Application Of Text Detection In Natural Scene Images Based On Deep Leaning

Posted on:2021-01-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y ZhongFull Text:PDF
GTID:1368330611967105Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Text in natural scene images contains rich and precise semantic information,which is an important visual element for image understanding and can be beneficial to a variety of application scenarios,e.g.,information retrieval,real-time translation,automatic driving,automatic reading,robotic process automation(RPA),etc.Consequently,scene text detection has drawn considerable attentions from computer vision and document analysis and recognition communities recently.However,because of diverse scene text variabilities in scales,shapes,orientations,languages,colors,fonts,layouts and alignments,extremely complex and text-like backgrounds,as well as some artifacts caused by image capturing like distortions,blur,nonuniform illumination,strong exposure and occlusion,text detection in natural scene images is still an extremely challenging and unsolved problem.Although the traditional sliding windows based or connected component based scene text detection approaches have achieved some promising results,their performance is not satisfactory in real-world scenarios.Moreover,these methods generally have complicated text detection pipelines and contain several sequential core modules,which would lead to error accumulation.Recently,deep learning algorithm has developed rapidly and astonishingly.Owing to its powerful feature learning ability and end-to-end trainable characteristic,it has made a breakthrough in the fields of computer vision,speech,natural language processing,etc.Inspired by these advances,this thesis presents a comprehensive study on scene text detection based on deep learning,which can be summarized as follows:(1)We present an end-to-end trainable scene text detection approach.Previous scene text detection approaches(before 2016)generally have complicated detection pipelines and contain several sequential core modules,which cannot be trained in an end-to-end manner.To address this problem,inspired by the Faster R-CNN framework,we propose an end-to-end trainable approach for text detection in natural scene images.First,we propose a novel inception region proposal network(Inception-RPN),which slides an inception network with multi-scale windows over the top of convolutional feature maps and associates a set of text characteristic anchor boxes with each sliding position to generate high recall word region proposals.Next,we present a powerful text refinement network that embeds ambiguous text category information and multi-level region-of-interest pooling for text and non-text classification and bounding box localization refinement.These two networks share convolutional features of the backbone network and can be trained in an end-to-end manner.Experiments demonstrate that our approach achieves superior performance on the ICDAR-2011 and ICDAR-2013 scene text detection benchmark tasks,outperforming previous state-of-the-art results substantially.(2)We propose a scene text detection approach with high text localization accuracy.Compared with other object detection tasks,the requirement on accurate bounding box prediction poses an additional challenge to the domain-specific scene text detection task.The unsatisfactory text localization accuracy not only degrades the performance of the text detection task,but also affects the performance of the succeeding text recognition task.Therefore,in this thesis,we present a study to address this problem,and find that the text localization accuracy of the original bounding box regression module is unsatisfactory in certain cases.Instead of formulating the bounding box localization problem into a direct regression problem,we propose to formulate it into a relatively easier dense binary classification problem and use a Loc Net based localization module to replace the bounding box regression module.Experiments demonstrate that the Loc Net based localization module can boost the text localization accuracy significantly.Moreover,we consider the problems of small text detection and text-like backgrounds carefully and propose to use the skip pooling and online hard example mining techniques to address these two challenges.Furthermore,we present a simple yet effective twostage approach to convert the difficult multi-oriented text detection problem to a relatively easier horizontal text detection problem,which makes our approach able to robustly detect multi-oriented text instances with accurate bounding box localization.Owing to these improvements,our proposed text detector achieves superior performance on both horizontal and multi-oriented scene text detection benchmark tasks.(3)We propose a novel anchor-free region proposal network(AF-RPN)for multioriented and arbitrary-shaped scene text detection.Anchor mechanism plays an important role in current state-of-the-art top-down scene text detection approaches based on deep learning.These anchor-based methods are required to manually design a complicated set of anchors of different scales,aspect ratios and orientations to predict text region proposals or the target text instances,which makes them sophisticated and inefficient to some extent.To address this problem,we propose a novel AF-RPN,which can generate high-quality text proposals in an anchor-free manner by directly predicting the offsets from a given sliding point located in a text core region to the bounding box vertices of the concerned text instance.Moreover,we propose a scale-friendly learning method to let AF-RPN extract text proposals from multi-scale feature maps in a scale-friendly manner so that AF-RPN can be more robust to large text scale variance.Compared with anchor-based RPN,our proposed AF-RPN is more flexible and straightforward,and can achieve higher recall rate in the text proposal generation task.Furthermore,we propose to incorporate the proposed AF-RPN into the Faster R-CNN/Mask R-CNN framework by replacing the original anchor-based RPN.Extensive experiments demonstrate that the proposed Faster R-CNN/Mask R-CNN based two-stage text detection approach achieves superior performance on horizontal and multi-oriented as well as arbitrary-shaped scene text detection benchmark tasks.We are so happy to see that the idea of anchor-free is also popular in the field of object detection recently,which can further demonstrate the generality of our approach.
Keywords/Search Tags:Scene text detection, End-to-end trainable, Text localization accuracy, Anchor-free, Region proposal network, Deep learning, Convolutional neural network
PDF Full Text Request
Related items