Focused Crawling Based On Relational Subgroup Discovery

Posted on:2009-10-18

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Q Y Xu

Full Text:PDF

GTID:1118360245963118

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

While crawling the World Wide Web, a focused web crawler aims to collect asmany relevant web pages with respect to predefined topic(s) and as few irrelevant onesas possible. The fundamental technical di?culty of focused crawling lies in the ne-cessity to predict a web page's topical relevancy before downloading it. However, thedecision relies exclusively upon diverse indirect subtle relevance clues, which are ubiq-uitous but noisy, and extremely di?cult to be exploited by traditional machine learningapproaches. Such relevance clues usually are denoted as"link context". This paper putforward an approach to extract precisely those link context via parsing technique fromthe Natural Language Processing, and the approach has promising preliminary experi-mental results on WebKB dataset. Although precise extraction of"link context"helpsimprove hyperlink classification, in many cases these link context are still too limited.They are either too noisy or too sparse, consequently the hyperlink classification ispoor due to such problematic information source. To tackle this problem, this paperpropose a novel focused crawling algorithm based on"relational subgroup discovery".The contribution is twoâ€“fold. Firstly, it adopts firstâ€“order predicates to represent di-verse background with respect to a hyperlink, thus avoiding the technical challengeof extracting link context. Secondly, it induces firstâ€“order focused crawling rules by"Subgroup Discovery"technique. The reason of adopting subgroup discovery ratherthan traditional ILP classification is as follows. Unvisited hyperlinks accumulate inrapid pace in crawling frontier, so as long as we can locate a fraction of relevant hyper-links, the crawler can keep it busy, because a downloaded relevant web page often leadsto many more relevant hyperlinks, thanks to the so-called"topic locality"phenomenonon Web. Furthermore, often quite some diverse hyperlinks link to the same web page,and the page will be classified relevant if only one of them satisfy one of the subgroup rules. We utilized DMOZ dataset as our testbed. Experiments on a wide range of topicson DMOZ were conducted, with the conclusion that our approach is feasible. As soonas su?cient labeled instances with their background information accumulated, our ap-proach can induce quite some subgroup rules with large support and confidence. Thecrawling process afterwards will be guided by these learned rules, and consequentlydownloaded irrelevant web pages tend to diminish rapidly, while the crawling processcan be maintained by such rules. We conducted series of comparative experiments withother established alternatives. The experimental results show that our approach exhibitedge over others in terms of"harvest ratio".

Keywords/Search Tags:

Relational

PDF Full Text Request

Related items

1	Research On Algorithm For Relational Data Classification Based On Background Knowledge
2	A Study Of The Relationship Between Relational Governance,Formal Governance,and Information Technology Impact
3	Design And Implementation Of Integrated Query Middleware About Relational Database And Non-Relational Database For Structure Safety Monitoring
4	Development Of Relational Algebra To Relational Calculus Conversion System
5	Research On Migration Algorithm From Traditional Relational Database To Non Relational Database
6	Privacy Preserving Approaches For Relational Multiple Sensitive Attributes
7	The Theory And Application Of Multi-Relational Data Mining
8	Research On Application Of Relational And Non-Relational Database
9	The Research On Storing XML Document Into Relational Storage Based On The Mapping Method From XML Schema To Relational Model
10	Research On Update And Query Optimization In Probabilistic Relational Databases With Integrity Constraints