Font Size: a A A

Intermediate Document Xml-based Information Extraction Technology Research

Posted on:2006-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:C L ZhaoFull Text:PDF
GTID:2208360182476971Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the Web technique, more and more information rapidly expand in the Web. It has attracted much attention to deal with these numerous information resources. Therefore, the progress of the information extraction technology of the Web resources is of great importance. However, the traditional IE tasks from unstructured texts typically are based on NLP and restricted in a specific domain. With the boom of the Web, there is an urgent need for structural IE systems that extract from (semi-)structured documents. But as a basic foundation of the Web, HTML restrain the farther exploitation and utilization of the information resources due to its own limitation. And a great deal of other format documents are meetin the web and day's work. And there are great differences between the mode of organization and representation of documents as a result of different background. Document transformation among different document system is a necessary approach to content sharing and cooperation.After summarizing the circumstance, this article analyse the advantage of information extraction using XML, bring forward a middle document format based on XML, which is mainly including the title, structure, information of text format, links, tables and some metadata of documents. It is described in detail the method of transformation from familiar document format, such as PDF and Word, to XML middle document format. We have accomplished some document contents extraction tasks based on the XML middle document.The main features of the system are as follows:Realizing the analysis of contents and structure of several familiar format documents.Defining a general document format description language, realizing identification and analysis of a variety of documents based on descrpition of document format.Extracting the title of documents based on the middle document format.Extracting the title, abstract, keywords and other information of papers in electronic journal based on specific template.
Keywords/Search Tags:XML, information extraction, PDF, WORD, Document
PDF Full Text Request
Related items