As the main carrier of information,text exists widely in all kinds of visual scenes.The purpose of optical character recognition(OCR)is to obtain the content and location of text instances from images,which has a wide range of applications.Common applications include content review of images uploaded by social users,license plate recognition in parking lots and roads,identification of ID card information,auxiliary test paper judgment,photo search and so on.Due to text can provide semantic information that other visual information can not provide.In addition,it has the advantages of more consistent with human cognition and better interpretability.Recognition of text content accura accurately in various scenes is of great significance for visual tasks with text.Optical character recognition of shopping receipts plays an important role in smart business operation and personal financial management.These key information can be widely used in business strategy formulation,company’s financial system,personal consumption accounting and so on.There are some difficulties in text detection of receipts,such as complex shape of text instances,extra interference cases outside the receipt images,and the distance between different text is very close.This paper use saliency object detection as a branch task to construct a multi-task model,which can train text detection and salient object detection task at the same time.The two tasks share the same feature extraction network.The model integrated deformable convolution and can effectively deal with the irregular text instances.In addition,this paper proposes a data enhancement method suitable for text detection of shopping ticket data in natural scenes,which can effectively improve the robustness and accuracy of the model.Text recognition not only depend on image feature of the receipts images,but also rely on sematic information extract from text instance.The reason is that it is difficult to distinguish different characters only based on image features.Especially chinese characters is hard to distinguish by just a few pixels.However,excessive learning of the sematic information of texts can make the model overfit.In this paper,a new text recognition method based on decouped transformer is proposed,which can decouple the attention process and prediction process of attention mechanism to deal with the problem of attention drift.In addition,since transformer has multiple parallel self attention modules,it can make the model learn better the semantic relationship between global texts.Compared with the general attention mechanism,the inference speed of this model is 1.9 times.A synthetic dataset method proposed in this paper to enables the model to learn more common text combination and more complex environment,so as to improve the generalization ability of the model. |