Design And Implementation Of PDF Format Based Table Extraction Method

Posted on:2016-08-11

Degree:Master

Type:Thesis

Country:China

Candidate:H J Tang

Full Text:PDF

GTID:2298330467991849

Subject:Computer Science and Technology

Abstract/Summary:

PDF (Portable Document Format) is an unique portable cross-platform file Format developed by Adobe. The cross-platform feature makes PDF files widely used in Windows, Unix, Max OS and other current mainstream operating system and make it become the internet electronic document issued and ideal document format of digital information transmission. Now on the internet, more and more electronic books, product specification, company earnings announcement, network information, science, literature, E-mail, etc are firstly choosing PDF format as electronic document format.With the popularity of PDF format, a large number of valuable information are presented in the form of a PDF document. So to extract valuable information from PDF files have become a research hotspot in recent years. However, due to the complex structure of PDF, it is not that easy to extract text, graphics, tables from PDF files, especially for PDF tablesâ€™ extraction. PDF format is different from Html format. There is no definition for tables in a PDF file. A table in PDF files are collection of words and lines which make PDF table extraction a big challenge. Traditional methods for table recognition and extraction are highly relies on the tag of Html form information which is not suitable for PDF table extraction. In order to solve this problem, this paper presents a general method for PDF table recognition and extraction. In order to verify the validity and accuracy of the method, the paper then apply the method on the extraction of financial table data, the result shows that the method has good performance.The paper firstly elaborated the research background, the main characteristics of the PDF is introduced, at the same time introduced the PDFBox library which is used by the system. Secondly, the paper compares several common table extraction method, through the comparison and analysis of the pros and cons of each method eventually lead to the method which is used by the paper. The paper then made detailed introduction of the PDF filesâ€™ table extraction method including basic box line identification, table reduction, first raw and first column processing, across page table merging, tabular data format, etc. Finally, the paper tested and evaluated the methodâ€™s performance by implementing the three financial statements form data recognition and extraction.

Keywords/Search Tags:

structure of PDF, table recognition, table rasterize, boxline identification

Related items

1	Research And Implementation On Table Detection And Table Structure Recognition Method Based On Deep Learning
2	Research And Implementation Of The Web Page Table Structure Recognition
3	Table Recognition Algorithm In Document Images Based On Deep Learning With Its Implementation
4	Identification And Application Of Hand Filled Paper Form
5	The Research And Implementation Of Table Recognition System Based On Deep Learning
6	Table Recognition Based On Digital Image Processing
7	Design And Implementation Of A Table Recognition System Based On Deep Learning
8	Research On Table Structure Recognition Based On Visual And Text Features
9	Table Content Extraction Based On Image Processing And Deep Learning
10	Research On The Recognition System Of Table Tennis And Batter Based On Computer Vision