Research For Semantic Information Extraction From PDF Document

Posted on:2005-01-01

Degree:Master

Type:Thesis

Country:China

Candidate:B Zhang

Full Text:PDF

GTID:2168360125454755

Subject:Computer application technology

Abstract/Summary:

PDF documents are widely used, the number of PDF used is significantly large, and the application of PDF keeps developing, more and more people or institutions begin to adopt PDF. The university of PDF used and the status of the its rapid development form a striking contrast to its low efficiency of management, semantic-based query and management for PDF must be done now.This system combines the technology of information extraction with that of machine learning. Valuable data can be extracted from PDF document according to semantics and it then will be wrapped into XML. This system has two principle processes. One is forming extraction rules. User understands the sample document in PDF viewer at first, then creates semantics schema for it and establishes the mapping between semantic item of schema and data item in PDF. At the same time of user learning, the sample PDF is converted into Well-formed XML. After the user learning and document conversion, the system automatically produces the rules from the Well-formed XML according to the mapping. The other is information extraction by using the rules and information wrapping. User submits the PDF documents and the domain information. The system preprocesses the PDF documents into Well-formed XML documents, then gets the extraction rules according to the domain information, then applies the rules to the Well-formed XML documents, so we get the self-described and semi-structured XML. Our system has a important meaning on the semantic-based query and management for PDF.

Keywords/Search Tags:

PDF, information extraction, XML, Semantics

Related items

1	Web Information Retrieval System Based On Classification Semantics
2	Field Information Extraction System Based On Semantics
3	Research For Semantic Information Extraction From PDF Document
4	Based The Multidimensional Semantics Internet Drug Information Extraction Research Applications
5	Efficient Plagiarism Detection Techniques And Systems On Semantics Of Academic Paper
6	Research On Dense Image Caption Algorithm Based On Depth Semantics
7	Research And Application Of Information Retrieval Method Based On Semantics
8	Research On Extraction Of Multimedia Semantics And Its Application In Video Watermarking
9	Extraction Technology And Internet Product Information Based On The Structural Semantics Of Entropy
10	Research On Technologies Of Relation Extraction Based On Frame Semantics