Font Size: a A A

Research On Tree Automata-Based Web Information Extraction

Posted on:2010-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:P X TanFull Text:PDF
GTID:2178360278980739Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, Web has become an important method of getting information. Now, more and more people begin to research how to obtain information fast and accurately from internet. Web information extraction came into being in this context. Web information extraction can not only get information which people need from internet, but also apply the information that extracted as the basis for intelligent query system and data mining system. So information extraction has broad prospects for development.In recent years, Web information extraction technology has been great development, but there are a series of questions, such as the complexity of extraction rules, low degree of automation. For the lack of Web information extraction, in this paper, presents how to achieve information extraction using tree automata technology, especially unranked tree automata technology. Its main innovation is as follows:For the unranked tree automata inference algorithm, inference inefficient and generated automata's scale too big to fit applied to information extraction, based on (k,l)-contextual tree language, presents a KLH tree language which used in information extraction technology. Based on the KLH tree language, we propose a KLH unranked tree automata inference algorithm. Using the KLH algorithm, the generated automata has a smaller scale and a high efficiency.Considering existing Web pages usually contains massive information irrelevant to the subject, this paper use a noise filter algorithm based on DOM tree, on the basis of the structuralized analysis to Web pages. In the algorithm, proposes a concept of noise coefficient which is used for the tree matching algorithm to ensure that whether the result is noise or not, and then remove the irrelevant information, in order to improve efficiency.In this paper, we designed an information extraction prototype system based on unranked tree automata technology. The system based on the DOM tree, use a noise filter algorithm to reduce the Web documents' production scale, and use the KLH unranked tree automata inference algorithm to obtain an unranked tree automaton from the DOM tree. The unranked tree automata would be the extraction rule, and we would extract data according to the unranked tree automata state of acceptance and rejection. Experimental prototype system shows that: the system to ensure efficient extraction, at the same time, the accuracy and the recall rate can reach very high levels.
Keywords/Search Tags:Web information extraction, tree automata technology, unranked tree automata technology, KLH tree language, KLH unranked tree automata inference algorithm
PDF Full Text Request
Related items