Font Size: a A A

The Design And Implementation Of Chinese Search Engine

Posted on:2005-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2168360152969128Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Search engine is a primary tool of information retrieval, and crawler is an essential component of search engine, which is designed to download web pages. To design a extensible, high performance and large scale search engine, the core task is to design a extensible, high performance and large scale crawler. Given the size and increasing speed of web, a parallel crawler system is designed. The system is made up of multi-crawler processes. Each crawler process runs at a computer, and each computer contains a single crawler process. Every crawler process has a local page repository and a local index repository. The web pages downloaded are saved in its local page repository, and the index built is saved in its index repository. In order to coordinate between multi-crawler processes so that overlapping can be avoided, the URL server is designed. It runs at a single server, to dispense URLs between multi-crawler processes, and save URLs that crawlers have found. Given the overload of database, parallel access of multi-database is implemented. Each crawler process is a small search engine, and all small search engines form a large scale search engine. The retrieval server is designed, which submits user's request to all crawler processes, and every crawler retrieves in its local index repository, and then sends its search results to the retrieval server. Finally the retrieval server collects all search results, sorts them and outputs. In order to decrease the heavy cost of local repository updating, incremental crawler is studied, which performs updating some old pages to refresh the local repository. A page changing model is built using artificial neural networks. Based on this model, the changing time interval of pages can be computed, and crawler can revisit those pages to refresh the local repository.
Keywords/Search Tags:search engine, neural networks, crawler, chinese word segmentation
PDF Full Text Request
Related items