Font Size: a A A

Research And Design On The Large-scale Full-text Searching System

Posted on:2005-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:J YuFull Text:PDF
GTID:2168360152967705Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With rapid popularization and development of Internet, a huge information ocean is in face of people. It is vital for us to find the information what we are really interested in from it quickly. Full-text searching system (mainly referring to search engine) is such a tool. However, most of the current full-text searching systems adopt centralized architecture, which results in poor performance in scalability and fault tolerance. Consequently, it is difficult for them to keep up with the rapid growth of data amounts. In order to solve above problems, we aim to build a distributed full-text searching system, called UbiSearch. To our best knowledge, there is no similar work as far as now. In this thesis, we focus on three aspects: system design, distributed ranks calculation and caching strategy. Our achievements are as follows.We complete the design of UbiSearch, including the designs of overall system, all modules and a simulation subsystem. Its fully distributed architecture brings the advantages of scalability, self–organizing and avoidance of single-point failure. Among all the modules, we focus on the design of indexer and searcher based on the Multi-Level Partitioning Strategy (MLP), a novel index partition strategy proposed by us. The simulation subsystem has good scalability and can support kinds of network topologies and peer-to-peer network protocols. The simulation results show that UbiSearch has low communication latency and load balance. More importantly, it efficiently reduces the bisection bandwidth consumption, solving the common problem in distributed searching area.A distributed PageRank algorithm is put forward based on PageRank algorithm, which makes distributed ranks calculation possible. Its convergence is proved in theory and the uniformity between its results and that of PageRank algorithm is further verified through the experiments. In addition, an indirect transmission mode is presented, which reduces the number of messages dramatically during calculation and achieves better scalability than the direct transmission mode.To optimize the performance of UbiSearch, we present a caching strategy called PP Cache. Compared with traditional caching strategy in which data is copied along the query path directly, it not only reduces bandwidth consumption by 1-3 orders of magnitude, but also holds a better performance in load balance. Besides, PP Cache also makes the system have a steady query performance even when there are few free spaces for storage. Furthermore, instant and complete deletion of caches is also supported in PP Cache, which eliminates the potential inconsistency caused by cache.
Keywords/Search Tags:UbiSearch, Distributed, Full-text Searching, Rank Calculation, Caching
PDF Full Text Request
Related items