Font Size: a A A

Design And Implementation Of PDF Format Based Table Extraction Method

Posted on:2016-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:H J TangFull Text:PDF
GTID:2298330467991849Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
PDF (Portable Document Format) is an unique portable cross-platform file Format developed by Adobe. The cross-platform feature makes PDF files widely used in Windows, Unix, Max OS and other current mainstream operating system and make it become the internet electronic document issued and ideal document format of digital information transmission. Now on the internet, more and more electronic books, product specification, company earnings announcement, network information, science, literature, E-mail, etc are firstly choosing PDF format as electronic document format.With the popularity of PDF format, a large number of valuable information are presented in the form of a PDF document. So to extract valuable information from PDF files have become a research hotspot in recent years. However, due to the complex structure of PDF, it is not that easy to extract text, graphics, tables from PDF files, especially for PDF tables’ extraction. PDF format is different from Html format. There is no definition for tables in a PDF file. A table in PDF files are collection of words and lines which make PDF table extraction a big challenge. Traditional methods for table recognition and extraction are highly relies on the tag of Html form information which is not suitable for PDF table extraction. In order to solve this problem, this paper presents a general method for PDF table recognition and extraction. In order to verify the validity and accuracy of the method, the paper then apply the method on the extraction of financial table data, the result shows that the method has good performance.The paper firstly elaborated the research background, the main characteristics of the PDF is introduced, at the same time introduced the PDFBox library which is used by the system. Secondly, the paper compares several common table extraction method, through the comparison and analysis of the pros and cons of each method eventually lead to the method which is used by the paper. The paper then made detailed introduction of the PDF files’ table extraction method including basic box line identification, table reduction, first raw and first column processing, across page table merging, tabular data format, etc. Finally, the paper tested and evaluated the method’s performance by implementing the three financial statements form data recognition and extraction.
Keywords/Search Tags:structure of PDF, table recognition, table rasterize, boxline identification
PDF Full Text Request
Related items