Research And Implementation On Aggregation Query Optimization Under The Big Data Environment

Posted on:2016-04-16

Degree:Master

Type:Thesis

Country:China

Candidate:J Li

Full Text:PDF

GTID:2348330536967507

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent years,"big data" has become a hot topic.Big data analysis techniques have been widely used in computer science,information studies,information systems,statistics and many other areas.Aggregation query is one of the most important data analysis techniques.Generally,aggregation query(sometimes is called data aggregation)is to summarize the results into useful information through an aggregate function(such as SUM,AVG,COUNT).Under the background of big data,data volumes are huge and users have high requirements for the speed of query,aggregation query techniques is challenging to researchers.This dissertation studies the aggregation query optimization and has the following results:(1)Under the big data environment,traversing a dataset for an aggregation query will consume lots of time,and users have high real-time requirements for aggregation queries.Therefore,approximate aggregation queries based on sampling techniques receive widespread attentions.But,to quickly obtain the query results and guarantee the accuracy of the results becomes a challenging job.In this paper,we propose an incremental sample expansion and error estimation method,called IBML.IBML uses bootstrap technique to perform accuracy assessment.When estimation error does not meet the user-defined one,IBML uses the Hoeffding equation to perform interactive sample expansion until estimation error meets the user-defined error bound.In addition,we deploy IBML on Spark platform,and implement IBML interfaces on Spark.Experiments show that IBML speeds up approximate aggregation queries 2 x than EARL.(2)Real-time aggregation query always needs constantly merge historical data and new arriving data,which will consume lots of time.This process seriously affects the efficiency of aggregation queries.Therefore,fast and lightweight aggregation queries become an important focus in the study of real-time aggregation queries.In this paper,we propose a lightweight parallel index,called IndexStream.We establish a balanced binary tree index for distributed data sets.IndexStream greatly improves the speed of query and has minimal storage overhead associated with this index structure itself.In addition,we implement Index Stream on Spark Streaming,and deploy our real-time data analysis platform for online aggregation queries,NRT.Experiments show that Aggregation queries time with IndexStream can be reduced from seconds to nanosecond.(3)In cluster,stragglers seriously affect the speed of aggregation queries,and reduce the efficiency of aggregation queries.Therefore,to mitigate stragglers in an aggregation query job becomes more and more important in aggregation query optimization.In this paper,we proposed a risk prediction model for stragglers mitigation in the distributed environment,called Hummer.By collecting historical information on each node in cluster,Hummer establishes a straggler risk prediction model.When submitting an aggregation job,Hummer makes use of this risk prediction model to perform partial clones effectively to mitigate stragglers in cluster.Experiments show that Hummer speeds up 46% than LATE and 18% than Dolly.

Keywords/Search Tags:

error-bounded, offline aggregation query, lightweight parallel index, straggler, risk prediction model

PDF Full Text Request

Related items

1	Research And Implementation On Sampling Of Approximate Aggregation Query Under The Big Data Environment
2	Study On The Parallel Query Based On The Index For A Native XML Database
3	The Research And Implement Of Parallel Query Over Massive Data On Multi Database
4	Research And Application Of Combinatorial Method For The Financial Risk Prediction
5	Queries with Bounded Errors & Bounded Response Times on Very Large Data
6	Research And Implementation Of Lightweight Parallel Computing Model Based On BSP
7	Research On The Key Techniques For XML Index And Query
8	Massive Data Aggregation And Parallel Implementation With Complex Constraints
9	Research On Algorithms About Temporal Aggregation Query
10	Research And Application Of Query Optimization Based On HBase