Font Size: a A A

Research Of Information Extraction From Web Documents Based On Tree Automaton

Posted on:2014-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:F YangFull Text:PDF
GTID:2268330422952303Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, an important way to extract information is from Web, how accurate and fast access to information in a huge Internet, there are more and more scientists use it as research directions. In this context, the Web information extraction technology came into being. Web information extraction technology not only can obtained the information of the user’s needed from the Internet, but also information can be obtained as the creation of data mining systems and intelligent query system. Therefore, the information extraction technology has broad application prospects.Information extraction (IE) is a means to extract specific information from a series of documents. Most information extraction work is extracted from the semi-structured document in XML or HTML, the existing technology is based on the extraction of the string method, like finite automata inductive method. However, this approach does not take advantage of the tree structure of the XML document. In this paper, we introduce a tree automata, instead of the extraction method of the string. This paper first introduces a Web information extraction, classification and evaluation, and analysis of tree automata, grammatical reasoning and information extraction technology. Rank tree automata reasoning aspects in g-testable and gl-testable inference algorithm based on the k-testable algorithm to improve the extraction of recall and precision rate, and based on this design based Rank tree automata Web information extraction system prototype. Finally, through experiments on benchmark data sets and large data show that this method is indeed much better than the string-based information extraction method.
Keywords/Search Tags:tree automata, information extraction, XML, grammatical inference
PDF Full Text Request
Related items