Font Size: a A A

Design Of A Focused Crawler-an Algorithmic Perspective

Posted on:2007-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:S L TanFull Text:PDF
GTID:2178360185451623Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
A crawler is a computer program that automatically retrieves and stores pages from the Web. It starts off with a list of URLs called seeds and traverses the web by retrieving a web page and then recursively retrieving all linked pages. As a special class of crawlers, focused crawler is developed to retrieve as many documents related to a given topic of interest as possible while reducing the network and computational resources. In the past several years, focused crawler has been considered one of the most important tools to build domain-specific search engines and digital libraries.This paper first presents an overview of the focused crawling domain. Then, it discusses some key issues in focused crawling research, including how to design Web analysis algorithm to predict relevance and importance of a web page before it's downloaded, how to choose search strategy, how to get seeds with high qualities and how to represent the topic. Based on these discussions, this paper introduces a focused crawler which can improve its analysis algorithm, quality of seeds and representation of the topic based on previous crawling. In our experiments, the crawler is tested in terms of the harvest rate. It turns out that the results are better than Breadth-first crawler, Best-first crawler based on content similarity and Best-first crawler based on PageRank metric.
Keywords/Search Tags:Focused crawling, Web analysis, Hyperlink analysis
PDF Full Text Request
Related items