Font Size: a A A

Layout Analysis And Table Extraction In Unstructured Documents

Posted on:2020-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ZhangFull Text:PDF
GTID:2428330575495165Subject:Pattern recognition
Abstract/Summary:PDF Full Text Request
A large amount of documents are in the form of images.How to structurize these unstructured documents is the initial step and the key technique for automated analysis of unstructured documents.For this purpose,this thesis adopts an improved regional convolution neural network(region-based convolutional neural Network,R-CNN),namely,Faster R-CNN,and an algorithm based on projection calculation in image processing.With the proposed method,the layout of unstructured documents is automatedly classified and positioned,and the forms in the unstructured documents are identified,extracted and transformed to structurize the unstructured documents.The main contents of the thesis include the layout analysis of unstructured documents and form analysis in unstructured documents.More detailed work is as follows:For the layout analysis of unstructured documents,the unstructured documents are first converted to images.The horizontal and vertical projections of the images are calculated to classify different layout components.Both image-processing algorithms and pattern recognition methods are used to identify and locate each layout component in the picture.In case image-processing and pattern recognition-based methods cannot reliably identify document layout components,the Faster R-CNN method is used to solve the puzzle.This hybrid method reduces the demand of the high computational power and the number of documents in the training dataset required by the Faster R-CNN method,and may still accurately classify the layout structures and locate the forms in the unstructured documents.For the form recognition of unstructured documents,image-processing algorithms are proposed to address the influence of image noise,form tilt and form occlusion.According to the style,all forms extracted from unstructured documents are categorized into one of the following groups,fully delineated form,horizontally delineated form,color encoded form and forms without lines.Specifically developed for the above categories,different algorithms are designed to recognize the form structure accurately.Finally,each cell in the identified form is split and characters are recognizedto reproduce the unstructured form in the Excel format.The average precision mean(Mean Average Precision,mAP)is used to measure the classification and positioning performance of the Faster R-CNN network on structurization of unstructured documents.The performance of the form recognition algorithm is evaluated by the identification rate and the conversion rate(the ratio of the number of forms identified and converted to the total number of experimental samples).The mAP index is 71.3%for the method proposed in this thesis,and the recognition conversion rate is 81%.The thesis realizes the classification and localization of the layout components of common unstructured documents,and also realizes identification,extraction and reproduction of forms in the layout of unstructured documents.Results indicate that the method proposed in the thesis may be used to identify and locate the texts,pictures,and forms in unstructured documents,and may reproduce the forms in the unstructured documents as electronic forms in the Excel format.This study may help facilitate further studies in utilizing unstructured documents that exist widely in the reality.
Keywords/Search Tags:Unstructured documents, Layout analysis, Formextraction, Form reconstruction
PDF Full Text Request
Related items