Font Size: a A A

Research For Semantic Information Extraction From PDF Document

Posted on:2005-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhangFull Text:PDF
GTID:2168360125454755Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
PDF documents are widely used, the number of PDF used is significantly large, and the application of PDF keeps developing, more and more people or institutions begin to adopt PDF. The university of PDF used and the status of the its rapid development form a striking contrast to its low efficiency of management, semantic-based query and management for PDF must be done now.This system combines the technology of information extraction with that of machine learning. Valuable data can be extracted from PDF document according to semantics and it then will be wrapped into XML. This system has two principle processes. One is forming extraction rules. User understands the sample document in PDF viewer at first, then creates semantics schema for it and establishes the mapping between semantic item of schema and data item in PDF. At the same time of user learning, the sample PDF is converted into Well-formed XML. After the user learning and document conversion, the system automatically produces the rules from the Well-formed XML according to the mapping. The other is information extraction by using the rules and information wrapping. User submits the PDF documents and the domain information. The system preprocesses the PDF documents into Well-formed XML documents, then gets the extraction rules according to the domain information, then applies the rules to the Well-formed XML documents, so we get the self-described and semi-structured XML. Our system has a important meaning on the semantic-based query and management for PDF.
Keywords/Search Tags:PDF, information extraction, XML, Semantics
PDF Full Text Request
Related items