Font Size: a A A

Research On Text And Specific Object Detection Algorithm In Images And Videos

Posted on:2022-04-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Q CaiFull Text:PDF
GTID:1488306509497694Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of science and technology,all kinds of intelligent shooting terminals are becoming more and more popular,and the corresponding shooting methods and content forms are becoming more and more diversified.Most of the information recorded by these devices is in the form of images and videos.Text and core objects in images and videos are helpful for effective retrieval and understanding of relevant information,so researchers need to accurately locate them.This paper summarizes the research status of text and specific object detection in images and videos,and on this basis,uses advanced deep learning methods to carry out in-depth research around some existing difficulties and pain points.The main contributions of this paper are as follows:1.For the problems of various sizes,diverse aspect ratios,and difficult discrimination of adjacent texts,this paper presents a novel inside-to-outside supervision network(IOS-Net)that can well tackle both two.Specifically,this paper designs a hierarchical supervision module(HSM),which consists of a new inception unit with parallel asymmetric convolution and a skip-layer fusion structure.Inside the HSM,this paper introduces hierarchical supervision into the new inception unit to effectively capture the texts with diverse aspect ratios.Outside the HSM,this paper adopts multiple-scale supervision on the stacked HSMs to accurately detect texts with various sizes.Moreover,a position-sensitive segmentation map is used to enhance the representation of difficult text objects and the discrimination of adjacent ones.2.For the problems of the multi-scale learning way and the lack of the scale-related datasets,this paper proposes a scene text detection method based on Scale-residual Learning Network(SLN)to deal with the scale variation problem in a progressive optimization manner.Specifically,this paper integrates both learnable feature concatenation and feature up-sampling operator.It can effectively eliminate the residuals between the outputs of SLN and the corresponding ground-truth text instances by processing both the Feature Fusion Residuals(FFR)and the Scale Transformation Residuals(STR),simultaneously.By stacking multi-scale feature maps in a deep-to-shallow manner,SLN continuously optimizes feature representation by accumulating strong semantic information and rich texture details in a scale-residual learning way.Besides,this paper establishes a large-scale scene text detection dataset(LS-Text),containing 36,000 images and 270,783 text instances,to promote the research of text detection.3.For the problem of different presentations of video text,this paper presents a video text detector based on a temporal consistent representation network,which can simultaneously localize all types of text in videos accurately.Our method consists of a spatial text detector and a temporal fusion filter.First,this paper explores to use three different strategies to learn the spatial text detector based on deep convolution neural networks,so that it can simultaneously detect various texts without knowing the text type.Then,a new area-first non-maximum suppression computation combined with multiple constraints is proposed to remove the redundant bounding boxes.Finally,the temporal fusion filter exploits the features of spatial locations and text components to integrate the detection results of consecutive frames to further remove false positives.4.For the problem of spatial-temporal detection of video text,this paper presents a spatiotemporal text localization method based on a sampling and divide-and-conquer network.Concretely,a unified framework is proposed which consists of the samplingand-recovery model(Sa RM)and the divide-and-conquer model(Da CM).Sa RM aims at exploiting the temporal redundancy of text to increase the detection efficiency for videos.Da CM is designed to efficiently localize the text in the spatiotemporal domain simultaneously.Besides,this paper constructs a challenging video overlaid text dataset named UCAS-STLData,which contains 57,070 frames with the corresponding spatiotemporal ground truths.5.For the problems of small objects and cluttered backgrounds in drone scenes,this paper proposes a drone scene object detection and counting method based on a Guided Attention network(GAnet).Different from the previous methods relying on unsupervised attention modules,this paper fuses different scales of feature maps by using the proposed weakly-supervised Background Attention(BA)between the background and objects for more semantic feature representation.Then,the Foreground Attention(FA)module is developed to consider both the global and local appearance of the object to facilitate accurate localization.Moreover,the new data argumentation strategy is designed to train a robust model in the drone scenes with various illumination conditions.6.For the problem of serious occlusion of the goods with the same category in retail stores,this paper proposes a new object Localization and Counting task(Locount)and a retail scene object detection and counting algorithm based on cascade localization and counting network(CLCNet),it requires the algorithm to localize groups of objects of interest with the number of instances.However,at present,there is no dataset that can meet this task.To this end,this paper collects a large-scale object localization and counting dataset with rich annotations in retail stores.To facilitate the fair comparison and evaluation of different algorithms,this dataset is divided into a training set and a test set to provide researchers with a benchmark dataset of object detection and counting task.In addition,this paper proposes a cascade detection and counting network as the benchmark algorithm,which can use the end-to-end way for multi-task training and can simultaneously predict the category of commodity objects,bounding box,and the number of instances of the bounding box.
Keywords/Search Tags:Image text detection, Video text detection, Spatial-temporal video text detection, Guided attention, Detection and counting
PDF Full Text Request
Related items