Font Size: a A A

Query Processing In Structured Peer-to-Peer Networks

Posted on:2008-03-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:R ZhangFull Text:PDF
GTID:1118360272959784Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Peer-to-Peer Computing(P2P) is a distributed system that is used for distributed resource management.Its goal is to aggregate enormous storage and process resources while minimizing resource visiting and searching cost.However,it has brought new challenges in designing such distributed data structures that can promise desired functionalities,such as data availability,dynamic system maintenance, independence,and support for complex query processing using unauthentic and unreliable components.Despite of current achievements in P2P-based query processing,there are still a lot of problems requiring to be studied.My research work is devoted to introducing the mature tree index structures of centralized systems into P2P environment,so as to solve the problems of query processing in structured P2P systems,including single dimensional and multi-dimensional range query,multi-attribute range query and high-dimensional information retrieval in P2P systems.The major contributions of this thesis include:1.A high performance index structure BATON~* is proposed for P2P systems, which is inspired by BATON.It is based on the m-way tree,not the binary tree,to support single dimensional range query processing.BATON~* can help to reach better query processing efficiency,better load balance performance and stronger fault recovery ability.It redefines the routing tables for each tree node and designs new search algorithm,which help to reduce the query processing cost to O(log_m N),where N is the network size and m is tree fanout. Additionally,the new search algorithm promises to avoid the root bottleneck problem.However there is a trade-off between the search cost and the updating cost,that is,the lower the search cost,the higher the updating cost. Considering this problem,a simple fanout-based cost model is introduced.It can help to estimate the approximate optimal value for m using the percentage between search operations and updating operations.Experiments are run in Planetlab.And the results verify that the new index structure has im- proved the search efficiency and the fault tolerance ability,and show that the fanout-based cost model is correct.2.A group-based multi-attribute query processing strategy is proposed.The character of multi-attribute query is that each query dose not have fixed attributes(dimensions) or fixed number of attributes.Using multi-dimensional index structures to deal with such queries,it will meet with much redundant messages.Using the single dimensional index structures to deal with such queries,the storage and maintenance cost is high.However each of them has its own advantages:the former one has lower storage and maintenance cost and the latter one has better search efficiency.The current design is going to group the attributes of the data space,use the space filling curve to reduce the dimensionality to single dimensional space and then index the transformed data into state-of-art single dimensional index structures,among which BATON~* is preferred.In such a way,on one hand,it takes full consideration of both the maintenance cost and the query processing cost,and on the other hand, it promises high query processing efficiency by employing BATON~* structure. Especially with knowing query probability to each attribute(dimensionality) in advance,its advantage will be more obvious.Simulation shows that the combination of group-based method with a high performance index structure will achieve good search efficiency with low storage and maintenance cost.3.A general framework,a virtual binary balanced tree,which supports multidimensional range query processing,is provided.It introduces the mature multi-dimensional index tree structures of centralized systems,which are based on hierarchical space division,to P2P systems successfully.The mapping method between the peers and the tree nodes,the way to distribute and store the data on the nodes,and the neighbor selection rules for nodes are all designed. Relying on this data structure,it puts forward the algorithm for multi-dimensional range query processing,which can solve range queries and KNN queries efficiently and bound the efficiency by O(log N) without the root bottleneck problem,where N is the network size.And uniform interfaces for data operation and network maintenance are defined,so any kind of spacedivision-based hierarchical tree structure can be easily implemented on the framework.The differences for these implementations are the way to split the space and the method to select the space to split.Only the leaf nodes manage data and the internal nodes are virtual nodes which just manage the indexing space.But each internal node has vertical(parent link and adjacent link) and horizontal(neighbor link) links to other internal nodes.These links are used to for failure recovery.AVL-Tree rotation method is responsible for the network restructure as long as the tree is detected unbalanced after resolving the load balancing problem.Experiments are run in Planetlab.The results verify that the design of the structure is reasonable and the query processing is efficient.4.Based on VBI-Tree,a new improved structure SDI is introduced.The main target of the new design is to reduce the maintenance cost.In SDI,it removes the data structure which records the ancestor index information for each internal node on the way to root,but uses one simple and specific ancestor link. The ancestor links are distributed among the nodes evenly.According to the change to tree structure,new search algorithms for both point queries and range queries are proposed,which can still promise the query efficiency but with less messages.That is to say,it can find the same number of results for queries but with less messages.Simulation shows that the new design gets better performance with larger dimensionality,bigger network size or more skewed data distribution.So the new design has better scalability.5.A hierarchical index structure,based on tree and Chord,for content-based search in P2P environmental is proposed.The main advantage is that it does not require any global knowledge maintainer.Each file and query can be represented as vectors of a list of terms.In this index structure,Chord is used to index terms and the hierarchical tree structure is used to classify and aggregate the vectors coming from the children.Each node on the tree will have k children(m≤k≤2m,m is the fanout).At each level,it arranges every g(m≤g≤2·m) adjacent nodes in the same group,and selects one leaf node as the parent.In each group,the parent calculates the weights for terms in the aggregated vector by TF.IDF.The larger the weight,the more important the terms to the group.Then some percentage of important terms are indexed directly to Chord by the parent,and the others are aggregated and sent to higher level parent node for further processing.As a result,it has no global knowledge maintainer in the system and the lower level nodes index most of the important terms out.The proposed query processing algorithm promises that query can be started at any node on the tree.Generally,the terms in the queries are important or popular ones,and they can be resolved by lower level nodes,so nodes at higher levels will not meet with obviously higher load.Some performance improvement methods are introduced in the end,which will improve the query processing quality and reduce the cost for both query processing and maintenance.Simulation shows that this index structure has good recall and precision without bottleneck problem.In summary,this thesis introduces four kinds of P2P index structures in details, which are all based on tree structures.Relying on the single dimensional index structure proposed by this work,one data distribution method is designed for supporting multi-attribute range query.All the work aims to solve the complex query processing in advanced application.For each index structure,it defines the search algorithms and explains the implementations as well as the load balancing strategy. The analysis or conclusion to each part of the work is verified by extensive experiments. These approaches are a kind of useful complement to and improvement on current query processing methods in P2P environment.
Keywords/Search Tags:peer-to-peer computing, framework, range query processing, multi-attribute, content-based search, index structure
PDF Full Text Request
Related items