| Data cube is a multidimensional data model which efectively supports OLAP.It can reduce the responding time of queries and improve the efciency of applicationvia pre-calculating and storing results of GroupBys of all combinations of attributes.As the explosion of massive data, it is an inevitable trend of the integration of datacube materialization and distributed computing model (e.g. MapReduce), which hasbeen more and more widely used.Data cube materialization can be efciently completed by simply using the frame-work of MapReduce for algebraic measures, e.g. SUM. While for holistic measures,such as DISTINCT, if we just integrate with MapReduce as the way algebraic mea-sures do, it will lead to the problems of load imbalance and tons of intermediatedata. The state-of-the-art distributed algorithm MR-Cube try to alleviate these twoproblems via data partitioning and batch area calculation. However, MR-Cube isnot accurate for data partitioning and may still lead to load imbalance under ex-tremely skewed circumstance. In batch area calculation, MR-Cube only proposesome rules rather than a simple and specifc batch method, and the algorithm tocalculate GroupBys is BUC, which cannot make full use of the performance of theframework of MapReduce.In this paper, we propose TeraSortPipeSort-Cube (TSP-Cube for short) whichborrows ideas from TeraSort and PipeSort, in order to thoroughly solve the problemof load imbalance and tons of intermediate data. Borrowing from the idea of randomsample of TeraSort, TSP-Cube partitions the data according to the frequencies of datain sampling, which not only reduces or even avoids unnecessary data partitioning,but also be suitable for diferent types of distribution. Meanwhile TSP-Cube applies Pipesort instead of BUC for batch area calculation, because PipeSort can take fullyadvantages of the characteristic of the framework of MapReduce. In addition, for thespecifc hierarchical data set, TSP-Cube puts forward a pipeline generation methodfor the generation of batch area according to the features of attributes of data sets andthe characteristic of PipeSort, and then solves the problem of tons of intermediatedata.Finally, we demonstrate that, TSP-Cube has better performance and more gen-eral in cube materialization with holistic measures, compared with current state-of-the-art algorithms, no matter under uniform distribution or extreme skewed distri-bution. The experiment also includes a comparison of algebraic measures, and thenwe can give a conclusion about the best algorithms under diferent situations. |