Online Aggregation Optimization For BIG Data In Cloud

Posted on:2016-03-12

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y X Wang

Full Text:PDF

GTID:1108330503976560

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Big data has been produced in various applications, including the user information of social networks, sensor data, scientific data and variety of log data, etc. And big data query processing is playing an important role in todayâ€™s fast-paced data-driven businesses. Dealing with a tremendous amount of data to derive the latent useful information has become a much urgent demand, however, it is difficult to support in traditional databases due to the massive volumes of data, the complexity and diversities of queries. Therefore, an obvious question emerges:What and how should we do to overcome the performance issue of big data query processing?Online aggregation in cloud computing for big data query processing is one of the effective answers to the aforementioned question. Online aggregation aims to give response to large-scale aggregation queries with a statistically valid estimate to the final result earlier (making a tradeoff between time and accuracy). The basic idea behind OLA is to compute an approximate result against the unbiased random samples and refine the result as more samples are received. In this way, users can terminate the running queries prematurely if an acceptable estimate can be arrived at quickly. Unfortunately, there also several limitations affect the overall performance, such as the poor scalability for the variety of data distribution, lack of support to the multi-queries optimization and suffer from the failure problem of online aggregation estimation method.To overcome the performance limitations of online aggregation mentioned above, we aim to study and research on the online aggregation performance optimization from three aspects, that are the data pre-processing, multi-queries sharing and online aggregation dynamic switch mechnasim, by fully considering the data management, task scheduling and executive model of cloud environment. According to the above research ideas, the main contributions of our dissertation can be summarized into the following four aspects:Firstly, we exploit a content-aware partition method with a NRB-T block index to optimize the sampling efficiency of online aggregation and make it much more scalable for varierty of data distribution. And present a fair-allocation block placement strategy, which is suitable for our content-aware partition method, to guarantee the storage and computation load balancing efficiently. Secondly, we propose a two-level sharing strategy of online aggregation, which is tailored for MapReduce framework, to improve the overall performance for running multi-queries in the cloud. The first-level sharing for sampling is implemented by a customized sample management mechanism, which can significantly reduce the redundant I/O disk cost. While the second-level sharing for statistical computation is implemented by a heuristic sharing groups generation algorithm called SLSA, which can not only improve the overall performance but also has a well scalability for larger number of sharing queries. Thirdlly, we propose a hybrid approximate query processing architecture for online aggregation coupled with the bootstrap-based approximate query model. And derive a probabilistic model to depict the online aggregation estimation failure possibility and then exploit a scheme switching mechanism based on such probabilistic model to switch the unpromising online aggregation queries into the bootstrap scheme dynamically, which can eliminate the effect of online aggregation estimation failure effectively. Meanwhile, we also propose a progressive estimation method to further reduce the probability of false positive during dynamic scheme switching. Finally, we carry out research into design and implementation of the cloud-based online aggregation prototype system on the basis of theoretical researches mentioned above, also called OLACloud, which is deployed on the Southeast University Cloud Computing Platform (SEUCloud). And all the system functional modules are tested in order to verify the effectiveness of theoretical achievements.The research of online aggregation in the cloud is explored deeply in this dissertation. Results from a series of simulations and experiments in the real cloud environment show that the strategies, algorithms and mechanisms proposed in this paper can be applied to significantly improve the performance of online aggregation in cloud. And both of the theoretical parts and system parts in this dissertation are of significant value for developing the big data query processing technologies.

Keywords/Search Tags:

Cloud computing, Big data query processing, Online aggregation, Query optimization, Approximate estimation, Random sampling

PDF Full Text Request

Related items

1	Research And Implementation Of Sampling-Based Aggregation Query System On Big Data
2	Research And Implementation On Sampling Of Approximate Aggregation Query Under The Big Data Environment
3	Sampling-based randomization techniques for approximate query processing
4	Research On Approximate Query Processing Over Inconsistent Data
5	Approximate Query Processing Technology Based On Distribution Perception
6	Research On Sampling Based Aggregate Query Method Of Power Quality Data
7	Efficient Algorithms For Approximate Aggregation And Nearest Neighbor Queries Over Multi-Dimensional Data
8	Research On Distributed Query Processing And Optimization Of RDF Data
9	Research On Approximate Query Algorithm For Real-time Analysis Of Massive Data
10	Research On Online Aggregation Query Optimization Based On Spark