Distributed Computation Of Data Cube

Posted on:2015-07-01

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Zhou

Full Text:PDF

GTID:2298330422977184

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Data cube is a multidimensional data model which efectively supports OLAP.It can reduce the responding time of queries and improve the efciency of applicationvia pre-calculating and storing results of GroupBys of all combinations of attributes.As the explosion of massive data, it is an inevitable trend of the integration of datacube materialization and distributed computing model (e.g. MapReduce), which hasbeen more and more widely used.Data cube materialization can be efciently completed by simply using the frame-work of MapReduce for algebraic measures, e.g. SUM. While for holistic measures,such as DISTINCT, if we just integrate with MapReduce as the way algebraic mea-sures do, it will lead to the problems of load imbalance and tons of intermediatedata. The state-of-the-art distributed algorithm MR-Cube try to alleviate these twoproblems via data partitioning and batch area calculation. However, MR-Cube isnot accurate for data partitioning and may still lead to load imbalance under ex-tremely skewed circumstance. In batch area calculation, MR-Cube only proposesome rules rather than a simple and specifc batch method, and the algorithm tocalculate GroupBys is BUC, which cannot make full use of the performance of theframework of MapReduce.In this paper, we propose TeraSortPipeSort-Cube (TSP-Cube for short) whichborrows ideas from TeraSort and PipeSort, in order to thoroughly solve the problemof load imbalance and tons of intermediate data. Borrowing from the idea of randomsample of TeraSort, TSP-Cube partitions the data according to the frequencies of datain sampling, which not only reduces or even avoids unnecessary data partitioning,but also be suitable for diferent types of distribution. Meanwhile TSP-Cube applies Pipesort instead of BUC for batch area calculation, because PipeSort can take fullyadvantages of the characteristic of the framework of MapReduce. In addition, for thespecifc hierarchical data set, TSP-Cube puts forward a pipeline generation methodfor the generation of batch area according to the features of attributes of data sets andthe characteristic of PipeSort, and then solves the problem of tons of intermediatedata.Finally, we demonstrate that, TSP-Cube has better performance and more gen-eral in cube materialization with holistic measures, compared with current state-of-the-art algorithms, no matter under uniform distribution or extreme skewed distri-bution. The experiment also includes a comparison of algebraic measures, and thenwe can give a conclusion about the best algorithms under diferent situations.

Keywords/Search Tags:

Data Cube, Distribution, MapReduce, TeraSort

PDF Full Text Request

Related items

1	Research And Implementation Of Building Data Cube Based On Mapreduce
2	Multidimensional Data Model For Mining And Analysis Based On Multiple Structure Data Cube
3	Research And Implementation Of Distributed Cube Distributed Storage And Construction Algorithm
4	Research And Implementation Of Construction Algorithms For Closed Histogram Cube
5	Research And Implementation Of Histogram Cube Compressed Storage And Incremental Updating And Query Under Cloud Environment
6	Techniques Research For Data Cube Compression
7	Research And Implementation On Mapreduce-based Aggregation Algorithms
8	Research On Data Cube Technology Based On MapReduce
9	Research Of Distributed Data Cube Partial Materialization Method Based On Genetic Algorithm
10	Cube Attacks On PRINCE And PRESENT