Research Of Text Structure Information Extraction Methods

Posted on:2011-04-08

Degree:Master

Type:Thesis

Country:China

Candidate:S S Zou

Full Text:PDF

GTID:2178330332961462

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Facing a tremendous amount of text data generated in the information explosion, the problem how to obtain the useful information more quickly and accurately and transform that into structured forms that facilitate computer processing has caused wide public concern. At the same time, XML with its specific advantages, is becoming the new standard of Web data representation and exchange. Consequently, extracting the structure information from unstructured text files and converting them into XML documents from the specified DTD files, which has great practical value, not only can improve the efficiency of text information retrieval, but also can play the advantages of XML and meet the requirements of the next generation information retrieval.Information Extraction is an effective avenue for converting information in unstructured or semi-structured into structured formats. However, the previous proposed IE algorithms are almost devoted to recognizing and annotating parts of text with semantic tags and the extraction results can only present the corresponding relations between semantic tags and text content, and not provide the sufficient structure information for the generation of XML documents. In order to extract enough structure information from the text files and construct XML documents, a novel text structure information extraction method based on Hidden Markov Models (HMMs) is proposed in this paper,which utilizes the path information in XML documents for the HMM training. Multiple HMMs could be trained with the XML documents that are described by multiple XML DTD files, using the universal emission probability matrix. Thus it could automatically obtain the structures from the texts and generate the XML documents with the corresponding DTD. Given the strong independence assumptions and the single feature problem of HMMs, a approach using Conditional Random Fields (CRFs) for the task of extracting higher-order structures from unstructured texts is proposed in this paper, which provides a powerful and flexible mechanism for exploiting arbitrary feature sets and models the conditional probability directly, and could solve the sparsity problem of training data to a certain degree. Finally, considering the function of integrating text into XML of the proposed methods,their application in data integration domain has been discussed and an architecture of data integration based on XML has been given.Experiments on a real-life dataset proved that the proposed method based on HMMs has received a preferable result. Meanwhile, although the method based on CRFs has difficulty in building the feature set but it has a higher precision and recall in comparison with the results by previous method. The alternative methods in this paper could not only be an excellent choice for solving the text processing problem of XML information retrieval system but also act as a strong reference for the research of data integration.

Keywords/Search Tags:

Information Extraction, HHM, Conditional Random Field, XML

PDF Full Text Request

Related items

1	Conditional Random Field Based Object Extraction
2	Research On Personnel Resume Intelligent Extraction System Based On Conditional Random Fields
3	Research For Event Extraction Method In Specific Domain Based On Tree Conditional Random Field
4	Research Of Text Structure Information Extraction Methods
5	Research And Realization Of Web Information Extraction For Specific Field
6	Research On Image Understanding Algorithm By Embedding Prior Information In Conditional Random Field Framework
7	Research Of Entity Knowledge Base System Based On Information Extraction
8	The Research And Application Of Conditional Random Field
9	Product Information Words Recognition Based On Conditional Random Field In Electronic Commerce
10	Event Extraction:Algorithms And Applications