Font Size: a A A

Research On Document Image Layout Analysis And Text Extraction

Posted on:2019-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:J F ZhuFull Text:PDF
GTID:2428330545974347Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The digitization of paper document based on image processing is an important field of pattern recognition.Through image processing,optical character recognition and other technologies,the paper document can be reconstructed by layout analysis and understanding.It can be converted into digitalized document resources in forms of double-layered PDF documents or Word documents and so on.With the Internet,the public can access to the archives more convenient.Layout analysis and text extraction are important basic steps for digital reconstruction of documents.Especially in the era of pursuing individuality and creativity,the typographic structure of document layout is becoming increasingly complex.Elements such as pictures tables and even printing and handwriting characters are mixed.Typography brings a lot of challenges to layout analysis and text extraction.For the two major problems of document image layout analysis and character extraction,the main work of this paper is reflected in the following two aspects:(1)Document layout object detection based on deep transfer learning.Due to the disorder and diversity of formulas,tables,and illustrations in document image layouts,traditional layout analysis methods often require different processing strategies for different document images.Because the recognition and localization module are completely independent,not only leads to system redundancy but also severely limits the versatility of the system.In order to realize a generalized layout analysis system and overcome the problem of insufficient sample data in document images,a document layout object detection method based on deep transfer learning was proposed.Because of the sufficient data in the semantic understanding of natural scenes for deep learning research,we adopt the transfer learning approach to migrate the object detection model in natural scenes to document layout targets like formulas,figures and tables.In one network framework,the multiple object of the document layout can be detected,which improves the versatility of the system.The experimental results show that the algorithm has a high object detection accuracy and improves the processing efficiency.(2)Research on handwritten text extraction algorithm under unconstrained writing condition.The text lines may suffer from tilting curving crossing and adhesion for the reason of unconstrained paper layout and free writing style.Traditional text line segmentation or clustering method could not guarantee the classification accuracy of the pixels between text lines.In this paper,a text line regression-clustering joint framework for handwritten text line extraction is proposed.Text line main body area is firstly extracted with smearing,and then the text line regression model is obtained by extracting the skeleton structure of the main body area.For the connected components classification and clustering,an approach based on associative hierarchical random fields is presented.A higher-order energy model is established by constructing a hierarchical network of Pixel-Connected Components-Text Lines.According to the model,an energy function is built whose minimization yields text line labels of the connected components.Finally,the sticky characters that share the same label are detected and the pixels of the sticky characters are re-clustered with k-means algorithm under the constraint of text line regression model.With the instance labels of text lines,the manipulation of the text lines can be realized by labels switch.Therefore,the geometric segmentation of the document image is no longer needed.Experiments show that the proposed framework improves the segmentation accuracy at pixels level,makes the edge of the text line more controllable than traditional algorithms such as piecewise projection,MST-based clustering,seam carving and so on.Proposed system shows high performance on text lines extraction together with better robustness and accuracy,at the same time,to the greatest extend,excluding the interference of adjacent text lines.
Keywords/Search Tags:document image processing, deep learning, object detection, layout analysis, text line extraction
PDF Full Text Request
Related items