Font Size: a A A

Research On Table Recognition Technique Based On PDF Text Stream

Posted on:2011-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhangFull Text:PDF
GTID:2178360305954046Subject:Computer Science and Technology
Abstract/Summary:
PDF (Portable document format) is the internationally accepted open standards for electronic documents. The file format has nothing to do with the operating system platform, this feature makes it the ideal document format, which can be issued in the Internet, electronic documents and dissemination of digital information. More and more e-books, product descriptions, company documents, web information and e-mail start using PDF. National government agencies, enterprises, institutions extensively use the format as a standard for information dissemination, exchange and storage.Tables as PDF an important part of electronic documents, are reused and re-edited with a very high frequency nowadays, however, the unique structure of PDF tables, making operation of some commonly used tables difficult to complete. In PDF is table is based on visual, that is, table format does not exist in PDF, there is only groups of words and some viewing the image line, we generally only see the visual results from the display table but can not directly access the table from the document format information, we call this table as "text stream" table, and for its recognition as "text stream-based table recognition." The traditional image-based table recognition technology is a relatively mature, but because of the huge difference between table carrier makes these techniques difficult to apply to the PDF text stream-based table recognition. Taken together, the PDF text stream-based table recognition has been studied, designed and implemented as a set of table recognition system.It was PDF text stream based table recognition system that was researched and realized in the dissertation, the system tables recognition and process to reproduce as follow: First, the system parses a PDF document, PDF content stream from the isolated text, images, original information ; after the text stream data structure of the system to establish and maintain text object information, the PDF contents visualization; Then, the user under the screen output to be reproduced, variant tables are located; after the contents of the grid system tables processing, that all the nodes in accordance with the text horizontal and vertical division of space, respectively, and save the results to form a conceptual frame structure to be reproducible table; then, the system according to the results of the table raster content emplacing, in other words,from the grid after the formation of the concept of table structure found in each table cell where the text stream node location, the establishment of the text stream between nodes, the relative relationship between the physical structure of the Table; Finally, the system entity structure of the sequence of one-dimensional output, the results are save into a common structural encoding formats such as HTML, so the results can be browsed as web forms,and can also be exported to the OA software for visual editing.
Keywords/Search Tags:PDF, text stream, table recognition
Related items