Font Size: a A A

Data Placement And Query Optimization In Large-scale Network Storage Environment

Posted on:2012-04-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:T ChenFull Text:PDF
GTID:1118330341451766Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The efficient management of increasing massive data has been a huge challenge in the areas of scientific research, engineering and information service. The massive data enables huge requirements towards large-scale storage environment. Current network storage cannot satisfy the application requirements of distributed large-scale data management in terms of scalability, high-performance, concurrency, distributed management, security and availability, consistent and reliability. Thus, researching on large-scale network storage is very important. This dissertation focuses on the research of key technologies in large-scale network storage such as data placement, query optimization and metadata load balancing, and proposes several effective solutions and algorithms. The main researches of this dissertation are as follows:(1) This dissertation proposes a multi-replication-aware and adaptable data placement algorithm RSEDP.The reliability and adaptability of large-scale network storage systems are confronted with big challenges, which require designing a reliable, adaptable data placement algorithm. Previous techniques can only partially satisfy these requirements. In this work, we propose a reliable and replication-aware data placement algorithm (RRDP), and efficient and adaptable data placement algorithm (SEDP). Based on these two algorithms, we address a multi-replication-aware adaptable data placement algorithm RSEDP, which combines them to achieve the requirements mentioned above. RRDP distributes replicated data over large-scale heterogeneous network storage systems to avoid distributing the same replica to the consecutive devices. It can achieve high redundancy degree and failure resilience. SEDP combines clustering algorithm and consistent hash, saving the consumption of memory space by avoiding extra virtual devices. It assigns data evenly among devices according to their weight and adapts well to the expansions or curtailments of the systems. In order to take the advantages of both RRDP and SEDP, RSEDP integrates them by categorizing data into hot and cold data based on their access frequency, placing hot data by RRDP, and distributing the remainder by SEDP. The theoretical analysis and the experimental study show that RSEDP can increase redundancy degree and failure resilience, and has a good adaptability and time efficiency with small memory overhead.(2) This dissertation addresses an effective and hierarchy data placement algorithm EHDP.A majority of proposed data placement approaches are appropriate for single level mode, while multi-level approaches cannot locate data in constant time and have bad adaptability. This dissertation presents a novel hierarchy data placement algorithm. Specially, it first uses Max-Min algorithm to classify the devices into some classes, which can manage large-scale storage devices by dividing and ruling, and support flexible device configuration. Then we propose EFAH hash algorithm to assign data between and within classes. The theoretical analysis and experimental study shows that EHDP can locate data in constant time to reduce the computation overhead of metadata server and avoid the performance bottleneck. Moreover, it can distribute data evenly among devices to balance I/O load. In the event of devices changes, our approach can redistribute fewer objects to preserve even distribution so that the performance of systems would not be affected in the process of rebalancing I/O load.(3) This dissertation presents multi-top-k optimization algorithms over uncertain data streams.With the wide applications of large-scale network storage, data is formed as a data stream. Uncertain is inherent to application data streams due to the external factors. Top-k query processing over uncertain data streams has become increasingly important. Sharing the results of these queries is a key factor to save the computation cost and provide real time response. However, due to the complex semantics of uncertain top-k query processing, it is nontrivial to implement sharing among multiple top-k queries and few work has been proposed to address the sharing issue. In this dissertation, we define the frequency upper bound of each top-k query and formulate various types of sharing among multiple top-k queries. We present an optimal dynamic programming solution as well as a more efficient (in terms of time and space complexity) greedy algorithm. The elaborate theoretical analyses prove the upper bound of dynamic programming and no sharing method, and the lower bound of greedy algorithm and dynamic programming. Experiments have demonstrated that the greedy algorithm can find the optimal solution in most cases, and it can almost achieve the same performance (in terms of latency and throughput) as the dynamic programming approach. Comparing to no sharing method and inter-group sharing method, dynamic programming approach and greedy algorithm enables much smaller computation cost for executing queries, higher throughput and lower latency.(4) This dissertation addresses a multi-aggregate optimization algorithm over data streams.Many applications towards large-scale network storage register multiple aggregate queries to the system which have different sliding window sizes and different frequency upper bounds. How to share the results of these queries is a challenge. Some work first addressed this problem and used the earliest-deadline-first (EDF) method. However, it did not present a concrete optimization algorithm. In this dissertation, we formulate the sharing problem among multiple aggregate queries with different sliding window sizes and different frequency upper bounds over data streams and propose a combination rule to classify these queries. Then, we present an efficient sharing algorithm to compute the execution plan of queries which enables a query be executed more often than necessary as long as the interval between two consecutive executions is no more than the frequency upper bound. Thus, more queries can share the computation. Regarding the underloaded and overloaded situations, we also combine our sharing algorithm with EDF. An experimental study shows that our sharing algorithms are more efficient than no sharing and EDF in terms of the number of scanned tuples for executing queries, the throughput and the latency.(5) This dissertation proposes an adaptable and distributed metadata load balancing algorithm (ADMLB).Metadata load balancing is very important to improve the performance of I/O. The existing load balancing schemas cannot evenly distribute accesses of metadata in a dynamic way and have bad adaptability and fault-tolerance ability. This dissertation presents a basic load balancing algorithm (BBLA) and distributed incremental load balancing algorithm (IBLA). Based on BBLA and IBLA, we propose an adaptable and distributed metadata load balancing algorithm (ADMLB) which combines them. Specially, ADMLB first uses BBLA to distribute metadata loads according to the performances of the metadata servers, and then uses IBLA to incrementally reorganize loads on each metadata server. ADMLB can evenly distribute loads between metadata servers, and adjust the load distribution according to the changes of the loads. It has good fault-tolerance ability, and locates metadata servers very quickly as well.
Keywords/Search Tags:Large-scale Network Storage, Data Placement, Multiple top-k Queries Sharing, Multiple Aggregation Queries Sharing, Metadata Load Balancing
PDF Full Text Request
Related items