Font Size: a A A

Research Of Text Structure Information Extraction Methods

Posted on:2011-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:S S ZouFull Text:PDF
GTID:2178330332961462Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Facing a tremendous amount of text data generated in the information explosion, the problem how to obtain the useful information more quickly and accurately and transform that into structured forms that facilitate computer processing has caused wide public concern. At the same time, XML with its specific advantages, is becoming the new standard of Web data representation and exchange. Consequently, extracting the structure information from unstructured text files and converting them into XML documents from the specified DTD files, which has great practical value, not only can improve the efficiency of text information retrieval, but also can play the advantages of XML and meet the requirements of the next generation information retrieval.Information Extraction is an effective avenue for converting information in unstructured or semi-structured into structured formats. However, the previous proposed IE algorithms are almost devoted to recognizing and annotating parts of text with semantic tags and the extraction results can only present the corresponding relations between semantic tags and text content, and not provide the sufficient structure information for the generation of XML documents. In order to extract enough structure information from the text files and construct XML documents, a novel text structure information extraction method based on Hidden Markov Models (HMMs) is proposed in this paper,which utilizes the path information in XML documents for the HMM training. Multiple HMMs could be trained with the XML documents that are described by multiple XML DTD files, using the universal emission probability matrix. Thus it could automatically obtain the structures from the texts and generate the XML documents with the corresponding DTD. Given the strong independence assumptions and the single feature problem of HMMs, a approach using Conditional Random Fields (CRFs) for the task of extracting higher-order structures from unstructured texts is proposed in this paper, which provides a powerful and flexible mechanism for exploiting arbitrary feature sets and models the conditional probability directly, and could solve the sparsity problem of training data to a certain degree. Finally, considering the function of integrating text into XML of the proposed methods,their application in data integration domain has been discussed and an architecture of data integration based on XML has been given.Experiments on a real-life dataset proved that the proposed method based on HMMs has received a preferable result. Meanwhile, although the method based on CRFs has difficulty in building the feature set but it has a higher precision and recall in comparison with the results by previous method. The alternative methods in this paper could not only be an excellent choice for solving the text processing problem of XML information retrieval system but also act as a strong reference for the research of data integration.
Keywords/Search Tags:Information Extraction, HHM, Conditional Random Field, XML
PDF Full Text Request
Related items