Design Of A Focused Crawler-an Algorithmic Perspective

Posted on:2007-03-09

Degree:Master

Type:Thesis

Country:China

Candidate:S L Tan

Full Text:PDF

GTID:2178360185451623

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

A crawler is a computer program that automatically retrieves and stores pages from the Web. It starts off with a list of URLs called seeds and traverses the web by retrieving a web page and then recursively retrieving all linked pages. As a special class of crawlers, focused crawler is developed to retrieve as many documents related to a given topic of interest as possible while reducing the network and computational resources. In the past several years, focused crawler has been considered one of the most important tools to build domain-specific search engines and digital libraries.This paper first presents an overview of the focused crawling domain. Then, it discusses some key issues in focused crawling research, including how to design Web analysis algorithm to predict relevance and importance of a web page before it's downloaded, how to choose search strategy, how to get seeds with high qualities and how to represent the topic. Based on these discussions, this paper introduces a focused crawler which can improve its analysis algorithm, quality of seeds and representation of the topic based on previous crawling. In our experiments, the crawler is tested in terms of the harvest rate. It turns out that the results are better than Breadth-first crawler, Best-first crawler based on content similarity and Best-first crawler based on PageRank metric.

Keywords/Search Tags:

Focused crawling, Web analysis, Hyperlink analysis

PDF Full Text Request

Related items

1	Reptile Theme System Based On Incremental Feedback And Adaptive Mechanism Design And Realization
2	Focused Web Crawling Strategy Based On Formal Concept Analysis
3	Study On The Application Of Automatic Focused Searching
4	Research And Application Of Web Crawling Algorithm Based On Semantic Analysis
5	Study On Focused Crawling Technique For Vertical Search Engine
6	Research On Web Hyperlink Analysis And Its Application In Search Engine
7	The Focused Web Crawling Strategy Based On Incremental Learning
8	The Extension Language King Figure Focused Crawling Extractor Experimental Studies
9	The Optimization And Achieve For Focused Crawling Algorithm Based On The Website Content Framework
10	Research On The Focused Crawling Combining Synthetic Web-Page Information And Domain Ontology