Font Size: a A A

Large Scale Peer-to-Peer Content Retrieval

Posted on:2011-06-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:H H ChenFull Text:PDF
GTID:1118360305992178Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The Internet has become an extremely large-scale distributed information repository with the prevalence of network applications and the rapid development of networking techniques. The extremely large-scale information with unlimited expansion brings new challenges to the current Internet data management infrastructure and makes the topic of buding decentralized data management and sharing infrastructure the inevitable research trend. Large-scale distributed content retrieval has an important academic and practical value.Peer-to-Peer (P2P) network is one of the state-of-the-art distributed systems built over the Internet. It breaks the traditional client-server model by introducing the concept of peer which basically acts as both a client and a server in the network. The P2P model efficiently harnesses resources such as computation, storage, communication, and information residing at the edges of the Internet to form a distributed cooperative network in an independent and equal manner. P2P networks demonstrate many good features such as decentralization, good scalability, and the ability to tolerate faults, hence it indeed has a great potential to become a major infrastructure for sharing information on the Internet.However, P2P networks are distributed, dynamic, and heterogeneous in nature, making it extremely difficult to build a large-scale content retrieval system without centralized repository and index. First, by utilizing distributed hash tables (DHTs), existing P2P data sharing systems can only provide exact-match style search support (such as file-name based search) with poor complex content retrieval capacity. Second, without a cetralized index existing state-of-the-art retrieval models and algorithms extensively used in centralized search engines are not applicable to such a distributed environment. Overall, the key issue-how to build efficient global index that supports complex queries and enables novel distributed retrieval model has not been effectively solved up to now, making large scale P2P content retrieval a challenging and open research issue.To address this issue, we extend dimensions of current P2Ps including the concept, architecture, resource description and management, routing mechanisms, and result merging and ranking mechanisms. In this dissertation, we focus on the solutions to build scalable content retrieval in large-scale P2P networks, and propose a set of novel theories and enabling techniques.Specifically, the contributions of this dissertation are fourfold.1.Theory of optimizing Bloom Filter (BF) for distributed set operations and multi-keyword seach protocol over structured DHTs using optimal BF settings. Multi-keyword searching over DHTs needs distributed set operations over wide area network, raising unacceptable bandwidth cost. To address this issue, we propose a novel of bloom filter optimizing theory and design PWEB, an effective and efficient multi-keyword search protocol over DHTs based on the theory. We conduce comprehensive large scale trace-driven simulations to evaluate PWEB design using National Institute of Standards and Technology (NIST) benchmark TREC WT10G large scale standard testing collection and the query logs from a major commercial Web search engine. Results show that PWEB significantly outperforms existing techniques. It reduces the network traffic by 74%, as well as reducing the search latency by 41%.2.Multi-dimensional distributed hash table technique and global full-text indexing searching and ranking strategies. We extend the traditional DHT structure, and propose a novel multi-dimensional distributed hashing structure to support term-set indexing over distributed environments. A novel index pruning algorithm, TSS, is designed to reduce the index size. We base our design on the observations of the query logs from a major commercial search engines. We show through experiment that the TSS significantly reduces the space complexity of the global index from 0(2n) to a formal bound of O(mlog n), reduces the communication cost to 28% of the cost existing schemes, and achieves retrieval performance comparable to traditional popular centralized information retrieval algorithms. 3.Federated search over semantic overlay. We design and implement SemreX, a P2P based literature sharing and retrieval system. The system work demonstates the principle of intrest locality in P2P systems. Based on the principle, we propose a novel peer similarity measurement model. Peers with high similarity values are linked together to form the semantic overlay. We further investigage how to leverage the small world topology features to improve the search performance of the semantic P2P overlay. Performance evaluation results show that SermeX federated search protocol based on semantic overlay can greatly improve the overall search efficiency by 81.6% compared to the popular Gnutella protocol, greatly improving the scalability of the unstructured P2P retrieval systems.4. Difficulty-aware hybrid search protocol in P2P networks. By combining structured DHTs and unstructured search protocols, hybrid P2Ps have the potential to effectively improve the performance of P2P retrieval systems. The key issue of hybrid P2P network is how to estimate the number of peers with matched items. Existing studies assume that such popularity can be obtained by estimating the number of relevant items in the distributed networks; that is to say if there are many matched items, they are also widely distributed across the network. In this case, flooding based search protocol will be more efficient, otherwise DHT-based search protocol will be more efficient. Based on the demonstrated principle of interest locality, we argue that the assumption of existing work is invalid and not practical in real world P2P systems. We further propose QRank, a novel difficulty-aware hybrid search protocol. It intelligently selects search strategies based on the ranked the query difficulty estimated with the gathered statistical information of keyword popularity in distributed P2P networks. We collect the trace from popular real world P2P systems, and conduct comprehensive large scale trace-driven simulations to evaluate the performance of QRank design. Results shot that QRank hybrid P2P search protocol significantly improve the search performance for hybrid P2P networks. It improves the recall by 21,reduces the search latency by 26%, and reduces the network traffic by 40%, compared to existing work.
Keywords/Search Tags:Peer-to-Peer Network, Distributed Hash Table, Bloom Filter, Hybrid P2P, Retrieval
PDF Full Text Request
Related items