Font Size: a A A

Learning to crawl: Classifier-guided topical crawlers

Posted on:2005-04-01Degree:Ph.DType:Dissertation
University:The University of IowaCandidate:Pant, GautamFull Text:PDF
GTID:1454390008990377Subject:Business Administration
Abstract/Summary:
Topical or focused crawlers follow the hyperlinked structure of the Web guided by the scent of information to identify and harvest topically relevant pages. For sniffing the appropriate scent they mine the content of pages that are already fetched to prioritize the fetching of unvisited pages. Topical crawling is currently a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. Sporadically, the use of classification algorithms to guide topical crawlers has been suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. We also explore the effects of various techniques for deriving contexts of hyperlinks on crawling performance. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the graph (i.e., the Web). We have designed and developed a crawling framework that allows for flexible addition of new classifiers. The crawlers themselves are implemented as multi-threaded objects that run concurrently. Our results show that Naive Bayes is a weak choice for guiding a topical crawler. We also find that a crawler that exploits words both in immediate vicinity of a hyperlink as well as the entire parent page performs better than a crawler that depends on just one of those cues. Also, a crawler that uses the tag tree hierarchy within Web pages provides effective coverage. We support these results with multiple crawls over more than one hundred topics covering millions of pages. Our post-hoc analysis provides insights into the results and the behavior of classifier-guided topical crawlers.
Keywords/Search Tags:Crawler, Topical, Pages, Web
Related items