Font Size: a A A

Research On Web Crawler Technology In Search Engine

Posted on:2010-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:H Y GuoFull Text:PDF
GTID:2178330332988356Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Along with the development of Internet and exponential growth of web information, search engine has become an indispensable tool for people to fetch information. For most search engines, how to use the limit system resources to collect pages effectively and efficiently has come to be a hot area in this search field. This paper explores a web crawler system, and does a deep study on the core algorithms of the system.This paper firstly analyzes the principle and the architecture of search engine, discusses the fetching strategy of web crawler, puts forward an improved fetching strategy based on page depth and weighted back-link count; Secondly, some critical algorithms are designed, for example multi-threaded web crawling, elimination of duplicate URL, scheduling strategy of web pages and so on.Besides, considering the character of Chinese search engine, a conversion for Chinese characters code is given to achieve the unification storage. Moreover, DNS cache mechanism is applied to speed up the collection pace. Last, Incremental crawling mechanism is applied to reduce the cost of time and resources when collecting the web pages which are not changed in the fetching circle.The experimental results show that the performance of the web crawler system has met the search engine requirements for mass data-processing.
Keywords/Search Tags:Web Crawler, Search Engine, Information Retrieval
PDF Full Text Request
Related items