Font Size: a A A

The Theme (topical) Crawler And Its Applications - Theme Search Engine

Posted on:2006-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:H F LvFull Text:PDF
GTID:2208360152965980Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. More and more people hope to achieve the information what they needed quickly and effectively. In this paper we propose a new hypertext resource discovery system called a Topical Crawler based on jeffheaton' s bot package(spider). Rather than collecting and indexing all accessible Web documents, the goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. It will avoid irrelevant regions of the Web, at the same time, it will lead to significant savings in hardware and network resources. We improve jeffheaton' s spider such as parsing web pages, crawling algorithm and so on, add two new functions: Chinese analyzer and a general I/O interface, and put forward a new page refreshing strategy based on the exited, we have proved it' s effectiveness theoretically.We use crawler' s application-search engine to evaluate the performance of jeffheaton' s spider and the topical crawler. The topical search engine is composed of three parts: crawler for searching for web pages, indexer for indexing the downloaded pages and the search part for searching the relevant pages with the user query. The later two parts are implemented by lucene (the open source software).
Keywords/Search Tags:Crawler, Topical crawler, Search engine, spider, lucene
PDF Full Text Request
Related items