Font Size: a A A

Research And Implementation Of Online Multiple Aggregation Query System Over The Big Data

Posted on:2016-12-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y G DangFull Text:PDF
GTID:2428330542489570Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advent of the big data era,enterprise data accumulated explosive growth.And we need to make business decisions based on massive historical data,OLAP's emergence makes it easier to process big data greatly.However,because of the big volume and high dimensions of the data,OLAP technology still faces serious challenges in terms of computing and storage,handling just alleviate these challenges under the distributed environment.In order to improve the query efficiency over large data,there has emerged data cube.However,Data Cube has a big drawback that its construction costs enormous time and space.To solve this problem,a effective lossless compression technique is appeared which is Closed Data Cube.The flexibility of Data Cube is also poor,which often support only one type aggregate queries,Histogram Data Cube can enrich it quite a lot.Based on existing compression method,this thesis proposes a new compression method,the storage structure of histogram data cube has carried on the optimized processing.The derived closed tuples and its basic tuple which has the smallest number of tuples stored together,and store the closed tuple corresponding measurements and closed tuples coding,which an integer coding represents a closed tuple.It can effectively reduce the storage space.This thesis has used an existing reversed count invert method to deal with measurement vector,so as to meet the approximate query,thereby reduce the cost of the measure vector.And this thesis has improved MRC-Cubing algorithm to make it easier and efficient to calculate All tuple and basic tuples,and proposed a calculate method over large closed tuples which balance the load of each task.Build a closed histogram data cube is a big spending on time,so we hoped that the new data can be quickly integrated into the closed histogram data cube.In this thesis,we analyzed the revenue and cost of incremental updating of data cube and proposed two methods of distributed incremental updating.The one is to merge the new data directly with the existing cube and the other is to merge the two cubes.These two methods reduce a lot of time when compare to the recalculate method.Users can choose which update method according to their own needs.In order to speed up the query of the closed histogram data cube,this thesis presents a query method based on MapReduce framework's inverted index,the speed of query is improved more obviously over large amounts of data.In order to achieve closed histogram data cube online nearly real-time queries,we use HBase as storage platform to store histogram cube and the index,according to the query key,query code and inverted index to achieve interactive query.In this thesis use TPC-DS test data set has proved by the experiment on the compression of the data cube,and the relative to recalculate and incremental updating data cube and the advantages of relative to the previous query efficiency of query algorithm and realization.
Keywords/Search Tags:Closed Data Cube, Histogram Cube, Incremental Updating, Online Query, Compression Storage
PDF Full Text Request
Related items