Font Size: a A A

Research On OCR Technology Based On Segmentation And Encoder-decoder Architecture

Posted on:2022-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:K Y XieFull Text:PDF
GTID:2518306572460044Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Optical character recognition(OCR),as an important technology to promote office automation,has a wide range of application scenarios.With the explosive growth of training data supporting OCR supervised learning in recent years,OCR technical research aiming at text detection and recognition has gradually become a popular research direction with very broad prospects in the field of deep learning.This paper mainly studies OCR technology based on deep learning,and will focus on how to ensure the performance of text detection and recognition in the OCR process,while reducing the model's inferencing time as much as possible,efficiently and quickly completing scene text detection and recognition process.For the text detection task,the proposed model first uses a series of convolutional neural network modules as backbone network,extracting feature maps from the original image,and a feature pyramid structure is added to the backbone network for multi-scale prediction.Then two branches are used to classify and segment the object respectively to get the final detection result.In the training stage,the loss function of the classification part of the model adopts focal loss,the segmentation part adopts the binary cross entropy loss function,and loss value of both parts are combined to supervise the entire text detection model.The text detection stage only needs to do binary classifications,good detection results can be obtained with single shot object detection methods,meanwhile the entire detection process is greatly accelerated.For the text recognition task,the proposed model first uses a set of convolutional neural networks as the encoder to extract the features of the images,then bidirectional Transformer structure is used to decode the extracted feature maps.The bidirectional Transformer structure is formed by stacking multi-head attention modules and feed-forward networks.This structure can better extract bidirectional features and support parallel computing.Finally,the output probability distribution and labels of datasets are input into the CTC loss function to supervise the training process of the entire model.In the inference stage,the probability distribution is directly decoded to obtain the final predicted text recognition result.In this paper,the text detection model based on segmentation and the text recognition model based on encoder-decoder architecture are trained separately through standard datasets.The models have converged on the corresponding datasets.The F1 score of the text detection model on the evaluation datasets is 85.79,and the FPS is 9.The detection accuracy of this model is equivalent to that of methods in related work,meanwhile the FPS has been improved by more than 3 times.The text recognition model has achieved an accuracy rate of 85.6%on the evaluation datasets;and the inferencing time per batch is 12.82ms,which is only 2.8%?36.6%of similar methods under the same conditions;the amount of model parameters is reduced by 2.5%?17.5%compared with similar models.The OCR model proposed in this paper has high practicability and can better serve scene text OCR application scenarios with low latency and fast response requirements.
Keywords/Search Tags:Deep learning, OCR, Segmentation, encoder-decoder, Supervised learning
PDF Full Text Request
Related items