Intermediate Document Xml-based Information Extraction Technology Research

Posted on:2006-11-10

Degree:Master

Type:Thesis

Country:China

Candidate:C L Zhao

Full Text:PDF

GTID:2208360182476971

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of the Web technique, more and more information rapidly expand in the Web. It has attracted much attention to deal with these numerous information resources. Therefore, the progress of the information extraction technology of the Web resources is of great importance. However, the traditional IE tasks from unstructured texts typically are based on NLP and restricted in a specific domain. With the boom of the Web, there is an urgent need for structural IE systems that extract from (semi-)structured documents. But as a basic foundation of the Web, HTML restrain the farther exploitation and utilization of the information resources due to its own limitation. And a great deal of other format documents are meetin the web and day's work. And there are great differences between the mode of organization and representation of documents as a result of different background. Document transformation among different document system is a necessary approach to content sharing and cooperation.After summarizing the circumstance, this article analyse the advantage of information extraction using XML, bring forward a middle document format based on XML, which is mainly including the title, structure, information of text format, links, tables and some metadata of documents. It is described in detail the method of transformation from familiar document format, such as PDF and Word, to XML middle document format. We have accomplished some document contents extraction tasks based on the XML middle document.The main features of the system are as follows:Realizing the analysis of contents and structure of several familiar format documents.Defining a general document format description language, realizing identification and analysis of a variety of documents based on descrpition of document format.Extracting the title of documents based on the middle document format.Extracting the title, abstract, keywords and other information of papers in electronic journal based on specific template.

Keywords/Search Tags:

XML, information extraction, PDF, WORD, Document

PDF Full Text Request

Related items

1	Research And Implementation About Assisted Writing System Of Traffic Information Standards
2	Information Extraction System For Three Types Of Information Disclosure Announcements Of Listed Companies
3	Research And Implement Of Information Hiding System Based On Word Document
4	Study On Information Hiding Techniques Based On Word Text Document
5	Visual Web Page Information Extraction And Text Feature Word Extraction Technology Research
6	Research And Implementation Of Short Text Topic Extraction Based On Document-Word Co-Occurrence Graph
7	Research On Large-Scale Chinese People Information Extraction Based On Web
8	Research On Multimodal Algorithm For Strutured Document Information Extraction
9	Research On Semantic Based Document Keyword Extraction Technology
10	A Chinese Word Level Segmentation Algorithm Based On Document Category