Font Size: a A A

Research And Implementation On Aggregation Query Optimization Under The Big Data Environment

Posted on:2016-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:J LiFull Text:PDF
GTID:2348330536967507Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,"big data" has become a hot topic.Big data analysis techniques have been widely used in computer science,information studies,information systems,statistics and many other areas.Aggregation query is one of the most important data analysis techniques.Generally,aggregation query(sometimes is called data aggregation)is to summarize the results into useful information through an aggregate function(such as SUM,AVG,COUNT).Under the background of big data,data volumes are huge and users have high requirements for the speed of query,aggregation query techniques is challenging to researchers.This dissertation studies the aggregation query optimization and has the following results:(1)Under the big data environment,traversing a dataset for an aggregation query will consume lots of time,and users have high real-time requirements for aggregation queries.Therefore,approximate aggregation queries based on sampling techniques receive widespread attentions.But,to quickly obtain the query results and guarantee the accuracy of the results becomes a challenging job.In this paper,we propose an incremental sample expansion and error estimation method,called IBML.IBML uses bootstrap technique to perform accuracy assessment.When estimation error does not meet the user-defined one,IBML uses the Hoeffding equation to perform interactive sample expansion until estimation error meets the user-defined error bound.In addition,we deploy IBML on Spark platform,and implement IBML interfaces on Spark.Experiments show that IBML speeds up approximate aggregation queries 2 x than EARL.(2)Real-time aggregation query always needs constantly merge historical data and new arriving data,which will consume lots of time.This process seriously affects the efficiency of aggregation queries.Therefore,fast and lightweight aggregation queries become an important focus in the study of real-time aggregation queries.In this paper,we propose a lightweight parallel index,called IndexStream.We establish a balanced binary tree index for distributed data sets.IndexStream greatly improves the speed of query and has minimal storage overhead associated with this index structure itself.In addition,we implement Index Stream on Spark Streaming,and deploy our real-time data analysis platform for online aggregation queries,NRT.Experiments show that Aggregation queries time with IndexStream can be reduced from seconds to nanosecond.(3)In cluster,stragglers seriously affect the speed of aggregation queries,and reduce the efficiency of aggregation queries.Therefore,to mitigate stragglers in an aggregation query job becomes more and more important in aggregation query optimization.In this paper,we proposed a risk prediction model for stragglers mitigation in the distributed environment,called Hummer.By collecting historical information on each node in cluster,Hummer establishes a straggler risk prediction model.When submitting an aggregation job,Hummer makes use of this risk prediction model to perform partial clones effectively to mitigate stragglers in cluster.Experiments show that Hummer speeds up 46% than LATE and 18% than Dolly.
Keywords/Search Tags:error-bounded, offline aggregation query, lightweight parallel index, straggler, risk prediction model
PDF Full Text Request
Related items