Research On Tree Automata-Based Web Information Extraction

Posted on:2010-06-29

Degree:Master

Type:Thesis

Country:China

Candidate:P X Tan

Full Text:PDF

GTID:2178360278980739

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, Web has become an important method of getting information. Now, more and more people begin to research how to obtain information fast and accurately from internet. Web information extraction came into being in this context. Web information extraction can not only get information which people need from internet, but also apply the information that extracted as the basis for intelligent query system and data mining system. So information extraction has broad prospects for development.In recent years, Web information extraction technology has been great development, but there are a series of questions, such as the complexity of extraction rules, low degree of automation. For the lack of Web information extraction, in this paper, presents how to achieve information extraction using tree automata technology, especially unranked tree automata technology. Its main innovation is as follows:For the unranked tree automata inference algorithm, inference inefficient and generated automata's scale too big to fit applied to information extraction, based on (k,l)-contextual tree language, presents a KLH tree language which used in information extraction technology. Based on the KLH tree language, we propose a KLH unranked tree automata inference algorithm. Using the KLH algorithm, the generated automata has a smaller scale and a high efficiency.Considering existing Web pages usually contains massive information irrelevant to the subject, this paper use a noise filter algorithm based on DOM tree, on the basis of the structuralized analysis to Web pages. In the algorithm, proposes a concept of noise coefficient which is used for the tree matching algorithm to ensure that whether the result is noise or not, and then remove the irrelevant information, in order to improve efficiency.In this paper, we designed an information extraction prototype system based on unranked tree automata technology. The system based on the DOM tree, use a noise filter algorithm to reduce the Web documents' production scale, and use the KLH unranked tree automata inference algorithm to obtain an unranked tree automaton from the DOM tree. The unranked tree automata would be the extraction rule, and we would extract data according to the unranked tree automata state of acceptance and rejection. Experimental prototype system shows that: the system to ensure efficient extraction, at the same time, the accuracy and the recall rate can reach very high levels.

Keywords/Search Tags:

Web information extraction, tree automata technology, unranked tree automata technology, KLH tree language, KLH unranked tree automata inference algorithm

PDF Full Text Request

Related items

1	Research On Construction And Minimization Algorithm Of Fuzzy Tree Automata
2	Algebraic Properties And Tree Language Of Finite Fuzzy Tree Automata
3	Based On The Number Of New Models Of Timed Automata
4	The Algebraic Properties Of Quantum Finite Tree Automata
5	Research Of Information Extraction From Web Documents Based On Tree Automaton
6	Research On State Complexities And Models Of Automata
7	The Properties And Regular Expressions Of Two Types Of Fuzzy Finite Tree Automata
8	Research And Application Of Classification Algorithm Based On Cellular Automata
9	Boolean Algebra And Automata Theory Over Boolean Algebra
10	Weighted tree automata and transducers for syntactic natural language processing