Font Size: a A A

Research On Data Cube Technology Based On MapReduce

Posted on:2014-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:2268330425484452Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of internet and information technology industry, theamount of data generated by network is ever-growing. Large data contains moreuseful information, it also brings more challenges. On-Line Analytical Processing(OLAP) as an important technology for data storage and analysis is also facing thechallenge of huge amount of computation. As the primary means of OLAP, how toefficiently deal with the massive data of the Data Cube is a key issue in both researchand application field of OLAP. Google’s MapReduce is a simplified distributedprogramming model for processing large-scale data. Based on this distributed parallelmodel, this thesis presents parallel clustering, update and query of the Data Cube. Themain research achievements and innovative points are as follows:(1) Parallel Clustering of Data Cube: Based on the equivalence relation betweenthe semantic features and multi-dimensions of Data Cube, a parallel semantic Cubehierarchical clustering algorithm based on the MapReduce framework is proposed.The Data Cube can rapid clustering, and ultimately save the equivalence classes of theupper and lower bounds to realize the compressed storage of Data Cube. This methodcan effectively save storage space, and speed up clustering procedure as well. Whencluster information and hierarchical information are saved, it can also provide theadvantage of rapidly updates of Data Cube and the possibility of analysis of OLAPquery behavior.(2) Incremental Maintenance of Data Cube’s hierarchical clustering: based onData Cube equivalence class, combining with the hierarchical relations between theequivalence classes, an efficient batch update algorithm of Data Cube in theMapReudce parallel framework is proposed, in this way, the problem of lowefficiency which is caused by large amount of data maintenance is solved.(3) Parallel OLAP queries: based on Data Cube equivalence classes, some paralleloptimizations on OLAP query point and query range are realized. What’s more, in theimproved MapReudce model, a cache-based OLAP query optimization algorithm isproposed. By defining various operations in the OLAP query, multiple OLAP queriesis parallel processing, which greatly improves the query efficiency.This thesis also analysis various operations of the semantic Cube parallelization indetails. The implementations of these operations under the MapReduce model are properly designed. Some comparisons of parallel algorithm and traditional algorithmare made to prove the superiority of parallel algorithm.
Keywords/Search Tags:OLAP, Hierarchical Cluster, OLAP Query Optimization, Hadoop, MapReduce
PDF Full Text Request
Related items