Font Size: a A A

The PDF Document Generation And Its Content Extraction In ScienceWord

Posted on:2008-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:P LiuFull Text:PDF
GTID:2178360272470054Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
PDF, one of the best choices for describing complex layout, has become a de facto industry standard, and is frequently used for scientific paper storage. ScienceWord, a professional scientific document word processor, can easily edit papers integrated with document objects of science, such as formulas, chemistry equations and geometry shapes. However, document format of ScienceWord hasn't been universally used yet, so it's necessary to add PDF storage module to ScienceWord to meet users'needs.The PDF storage module is independent from ScienceWord development framework. From a functionality perspective, there're 2 types of components in the storage module: components for basic document elements, such as texts, figures and images etc; components for document objects of science. In the previous components, basic document elements are described using Page Description Language in PDF file. At the same time, all the factors, which affect the generated PDF files, should be taken into consideration, like layout displaying accuracy, storage efficiency and platform independence. However, the latter components are more difficult to be implemented because of the complex logical information contained in document objects of science. Unfortunately this type of information can't be described in average PDF files. In order to decrease the information loss during storage, logical information are abstracted into structured expression, and then translated into logical trees in PDF through defining new PDF tags. The prototype system, which is implemented on the study results, can be used as a PDF storage solution due to its compatibility with ScienceWord frameworks. According to the test, all structured logical information can be expressed exactly, and then extracted truly through Acrobat plug-ins in the tagged PDF file generated by the prototype system.Since the entire page contents are well organized, it's easily expected that these tagged PDF files have got more advanced features. Also they will be helpful for future development of PDF file retrieval and other applications.
Keywords/Search Tags:Tagged PDF, Science Document Object, Document Storage
PDF Full Text Request
Related items