Font Size: a A A

Research On Scene Text Detection Based On Transformer

Posted on:2022-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2518306722971909Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In recent years,scene text detection has been paid close attention by academia and industry.In the past,the mainstream research direction in this field was based on the framework of convolutional neural network,but there are some limitations.For example,non-extreme suppression operations need to be performed during prediction,which may easily filter out some detection frames that are very close to each other,affecting the detection effect.In addition,the filtering threshold needs to be manually specified,resulting in poor universality.And you need to stack the convolution kernel to get it Enough receptive fields,complex network structure.In the past year,Transformer methods have been studied by scholars in the field of vision,breaking through some limitations in the traditional direction and achieving important results.However,check based on Transformer The calculation method also has some problems,such as poor detection effect for small targets,slow training speed and difficulty in convergence of large number of parameters.In order to cope with the above problems,this paper studies and proposes two new scene text detection models based on Transformer:Firstly,an end-to-end scene text detection model adapted to multi-angle target is proposed for multi-direction scene text detection task.The model adopts multi-scale prediction method to solve the difficulty of small target detection Transformer coding layer is connected to the feature graph in the deep layer of the feature pyramid,and then the local information and global context information are integrated by up-sampling and splicing method.This paper proposes a A local self-attention mechanism is proposed to improve the training speed of Transformer coding layer.Based on this,a local shared location coding is proposed to reduce the number of parameters and improve the generalization ability.Collections are also used in this article In the training stage,let Transformer decoding layer output a certain number of prediction boxes,and then through the Hungarian algorithm prediction box and target box matching,and calculate the specific loss.The advantage of this modeling approach is that the number of detection frames is controlled,and the non-maximum suppression operation is not needed in the prediction,which improves the detection effect.Experimental data were obtained from ICDAR2015,ICDAR2107 and MSRA-TD500 The proposed method achieves ideal results in both speed and effect.Second,on the basis of the above model,a Transformer detection network based on instance segmentation is proposed to solve the problem of object detection by segmentation to adapt to the diversity of text.Traditional based The detection method of instance segmentation usually only focuses on local information without considering the context.The model proposed in this paper introduces global context information through Transformer coding layer.In the decoding stage,this paper tries Two different methods of reducing to the size of the original picture are compared experimentally.The loss function of the network only retains the classification task.After obtaining the classification result,the final detection can be obtained through operations such as pixel aggregation As a result,this design makes the network much easier to train than a regression task.
Keywords/Search Tags:Scene Text Detection, Transformer, Convolutional Neural Network, Object Detection, Instance Segmentation
PDF Full Text Request
Related items