Font Size: a A A

Similarity Search In Peer-to-Peer Systems

Posted on:2006-04-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:L H XuFull Text:PDF
GTID:1118360212484466Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Peer-to-peer computing (P2P) has become an extremely popular topic in computer science. In a P2P system, each peer is fully autonomic, has equal responsibility and plays the role of both a service costumer and a service provider. Moreover, peers can enter or leave the P2P network at any time. Thus, a P2P system is a type of fully dynamic distributed system without any central administration. The P2P computing paradigm has many advantages, such as, scalability, robustness, resource availability etc, and is especially suitable for distributed applications with properties of geographical distribution, heterogeneous resource, scalability and local autonomy. Hence, the P2P computing paradigm is driving the evolution from the host-centric Internet to the data-centric future Internet, and is thought of as a promising technology to reconstruct the future Internet-based applications.Despite of current achievements on P2P-based query processing, there are still a lot of problems need to be studied. To this end, this thesis is devoted to the issue of similarity search in P2P systems. It studies routing index, space partitioning, collaborative caching and probabilistic model-based similarity search techniques, in order to support semantic or similarity-based query processing methods above existing P2P systems. Major contributions of this thesis include:1. Similarity search in multi-dimensional data space is introduced to unstructured P2P systems. A simple yet effective routing index structure, called EVA-Index, is designed by combining both vector approximation and routing indexing techniques together. With the aid of EVA-Index, each peer can process similarity query with its local dataset, and route queries to promising peers with the desired data objects. Furthermore, each peer can reconfigure its neighboring peers to keep the relevant peers close by so as to improve system performance. Simulation shows that the proposed approach is efficient and effective to similarity query processing in unstructured P2P environments.2. An efficient space partitioning strategy is introduced to a structured P2P system. The whole data space is first partitioned based on a set of pre-generated reference points. Each reference point has an associated subspace and any pair of subspaces does not overlap with each other. As such, through building one-to-one mapping between nodes and reference points (and hence subspace), similarity search in high-dimensional space can be implemented by using the traditional high-dimensional indexing technique and distributed hash table-based resource location mechanism. Moreover, the routing cost of similarity search can be greatly reduced through capturing physical proximity of sub-spaces, and the load balance at nodes can be obtained by tuning the granularity of subspaces. Simulation shows the efficiency of the proposed method can be kept in term of dimensionality and network size increasing.3. The technique for collaborative cache, based on negotiation, is studied for supporting SQL query processing, and a cost model for evaluation of the network transmission cost of query plans is given. The cost of the query plans are estimated by using the cost of sub-query plans. The cost model is combined with collaborative overlap network, through negotiation between requesters and coordinators, to determine the logical expression and physical maintenance nodes of data cache. Thus, based on collaborative caching technique, the P2P system can implement semantic-based query processing. Simulation and real experiments show that the proposed method usually obtains near-optimized cache plans and reduces the cost of query processing, especially in the case that a single node can contribute limited storage space.4. A similarity search technique based on probabilistic information is investigated for the P2P file sharing application with hierarchical topics. This approach uses probabilistic model to describe the overlap between topics and the coverage of nodes to topics, and hence builds routing indices at nodes. Based on existing probabilistic information at nodes, the similarity search algorithm can deduce the coverage of neighboring nodes to the queried topic, so that a promising routing path can be determined. Further, using feedback of the previous queries, nodes that were responsible for routing topic queries can update their probabilistic information to guide future ones more efficiently. Simulation shows the proposed approach can gradually improve the search efficiency and effectiveness in a feedback-based way.In summary, this thesis gives a detailed description of the algorithm, implemen-tation and experimental evaluation of the above four similarity query processing approaches. These approaches are a kind of useful complement to and improvement on current query processing methods in the P2P environment. The work is based on complete survey and theoretical analysis along with extensive experimental evaluation. The experiments and analysis show that, compared with existing query processing methods for P2P systems, the approaches mentioned above have advantages in efficiency of both query processing and resource utilization.
Keywords/Search Tags:peer-to-peer computing, similarity search, index, cache, probability
PDF Full Text Request
Related items