Font Size: a A A

Design And Implementation Of A Distributed Search Engine Caching System

Posted on:2012-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:P ZhouFull Text:PDF
GTID:2208330332992875Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
According to CNNIC2011 issued in January the 27th China's Internet development statistics report that by 2010 in December, this number of Chinese netizens scales to 457 million,and search engines become the most popular Internet applications,meanwhile the number of Chinese web page scales to 60 billion, which has increased 78.6% since last year.The rapid development of Internet,has brought new challenges for search engines. Large-scale Web search engine need to handle tens of thousands of queries per second on average, and with queries to the explosive growth of network information,each query involves the mass index data,so query processing has become the main bottleneck of search engines.In order to not under the premise of reduced inquires, in order to improve the response speed of search engines quires, large-scale Web search engines usually employs cache technology. Search engine of caching system in general can be divided into two categories, one is results cache. the result of some queries is cached, so that when the query appears again, they can take results directly from the cache, thus greatly improves the response speed of quires. Another one is inverted chain cache. Since the capacity of an inverted index processed by search engines usually is huge, the inverted index can't completely be loaded into memory. Hence in retrieval time the inverted index need to be read from disk. The literacy of I/O operation on disk is much slower than memory, so the literacy of I/O operations on disk takes the most of searching time. Chain cache is used to reduce the waste of time. The inverted file contains higher frequency words is loaded to memory in order to improve the average retrieval performance of search engines.Based on the retrieval core and distributional framework of CAS-ICT I3Search, this paper designs and implements a secondary distributed search engine caching system. This paper proposes a new cache algorithm of search result. It can reduce inquires the premise, improve the quality of search engine in the index, under the situation of quick update of cache usage efficiency. The main contributions are: 1) This paper implements an inverted chain cache module in I3Search retrieval core. This module cached the best benefit inverted chain according to query frequency, the length of inverted chain and so on,in order to improve the I3Search performance.2) The major problem of results cache is that it is unable to cope with the rapid renewal, as long as the index data changed, the caching data all failure. In order to solve this problem, this paper designs and implements a new results cache algorithm, which cached the document of query results. Even in the situation the index is updated quickly, it still is able to improve response speed with cached documents.
Keywords/Search Tags:Distributed Search Engines, Query cost, Result Caching, List Caching
PDF Full Text Request
Related items