Font Size: a A A

Focused Crawling Based On Relational Subgroup Discovery

Posted on:2009-10-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q Y XuFull Text:PDF
GTID:1118360245963118Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
While crawling the World Wide Web, a focused web crawler aims to collect asmany relevant web pages with respect to predefined topic(s) and as few irrelevant onesas possible. The fundamental technical di?culty of focused crawling lies in the ne-cessity to predict a web page's topical relevancy before downloading it. However, thedecision relies exclusively upon diverse indirect subtle relevance clues, which are ubiq-uitous but noisy, and extremely di?cult to be exploited by traditional machine learningapproaches. Such relevance clues usually are denoted as"link context". This paper putforward an approach to extract precisely those link context via parsing technique fromthe Natural Language Processing, and the approach has promising preliminary experi-mental results on WebKB dataset. Although precise extraction of"link context"helpsimprove hyperlink classification, in many cases these link context are still too limited.They are either too noisy or too sparse, consequently the hyperlink classification ispoor due to such problematic information source. To tackle this problem, this paperpropose a novel focused crawling algorithm based on"relational subgroup discovery".The contribution is two–fold. Firstly, it adopts first–order predicates to represent di-verse background with respect to a hyperlink, thus avoiding the technical challengeof extracting link context. Secondly, it induces first–order focused crawling rules by"Subgroup Discovery"technique. The reason of adopting subgroup discovery ratherthan traditional ILP classification is as follows. Unvisited hyperlinks accumulate inrapid pace in crawling frontier, so as long as we can locate a fraction of relevant hyper-links, the crawler can keep it busy, because a downloaded relevant web page often leadsto many more relevant hyperlinks, thanks to the so-called"topic locality"phenomenonon Web. Furthermore, often quite some diverse hyperlinks link to the same web page,and the page will be classified relevant if only one of them satisfy one of the subgroup rules. We utilized DMOZ dataset as our testbed. Experiments on a wide range of topicson DMOZ were conducted, with the conclusion that our approach is feasible. As soonas su?cient labeled instances with their background information accumulated, our ap-proach can induce quite some subgroup rules with large support and confidence. Thecrawling process afterwards will be guided by these learned rules, and consequentlydownloaded irrelevant web pages tend to diminish rapidly, while the crawling processcan be maintained by such rules. We conducted series of comparative experiments withother established alternatives. The experimental results show that our approach exhibitedge over others in terms of"harvest ratio".
Keywords/Search Tags:Relational
PDF Full Text Request
Related items