Font Size: a A A

Research On Layout Analysis And Text Line Extraction Of Document Image

Posted on:2020-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhangFull Text:PDF
GTID:2428330590494383Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Digitalization of documents has broad applications.Using Optical Character Recognition technology,we can directly extract the data we need from the image,which will greatly facilitate the storage,processing and retrieval of information,and also reduce the burden of manual input.The accuracy of text box extraction is an important prerequisite for the successful completion of text recognition.At present,a number of deep learning models such as CNN+LSTM+CTC have been proposed,which effectively solve the problem of end-to-end text character recognition.But the performance of text-row extraction is far from satisfactory.Therefore,this dissertation is mainly focused on extracting text rows from original pictures more effectively and accurately.Because of the issues such as skew and complex background in the document image,it usually contains a lot of noise or invalid information,which will greatly affect on the final recognition performance.To address such issues,we first introduce the preprocessing method of skew correction and de-noising.Then,to accurate detect text objects in a document image,this dissertation presents a method of object detection and semantics segmentation based on deep learning method.This method effectively solves the problems in traditional method which are difficult to extract page features and have poor generality.The general algorithm is refined and the multi-scale feature fusion is used.To verify the performance of proposed method,mAPs of IOU are used on 2017 ICDAR page object detection dataset,the results improved from 0.787 and 0.637 to 0.865 and 0.752 for the indicators 0.6 and 0.8 respectively.Considering that some preprocessing methods optimized for specific document obejects may not suitable for other objects,in order to reduce the loss of information,corresponding processing should be done according to the different areas of page objects.For example,offline processing in the table area,and removal processing in the seal area by the method of separating color channels should be run on different objects.According to the property of text distribution between pure text pages and table pages,different text box extraction algorithms are designed in this thesis.The text box extraction of pure text pages is a combination of CTPN algorithm based on deep learning and projection method,which effectively solves the problem of text box extraction under complex page background.Through the design of text extraction algorithm based on different page features,a better text extraction algorithm is achieved.By combining the text detection algorithms with an OCR engine,a complete document recognition system is implemented.The experimental are conducted on the corpus constructed for a real application,and the results show that the system can achieve good results in image denoising,page object detection and text box extraction,and the whole system reaches the satisfactory performance for real application.
Keywords/Search Tags:page object detection, image preprocessing, complex layout analysis, text detection
PDF Full Text Request
Related items