Font Size: a A A

Research On Distributed OLAP Query Optimization Based On Hive

Posted on:2018-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2428330596954768Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the fierce market competition,it is particularly important for enterprise to seize the initiative,which use the OLAP analysis rapidly.In big data era,the distributed architecture of the database is to provide the basis for mass data analysis services,but when executing OLAP query,it will bring larger delay in dealing with connections,grouping and other complex queries.Therefore,it is necessary to optimize the distributed OLAP query,this thesis takes the distributed data analysis platform Hive,which is widely used in business as a case study.In this thesis,the existing problems and goals of distributed OLAP query optimization are analyzed,and then an optimization scheme is proposed,introducing precomputed technology into Hive to generate cubes which are stored in HBase.So Hive's OLAP queries are converted to HBase cube queries.The thesis also analyzes technology of the optimization scheme and do research on each technology then realize it.The main researches are as follows:(1)For the problem of low efficiency of data cube optimization for OLAP query,the thesis researches two cube algorithms under the Map Reduce programming model,layer cube algorithm and segmented cube algorithm.By designing and comparative analysis two kinds of algorithms,the conclusion shows that calculated by segmented cube algorithm is more quickly.An improved algorithm based on segmented cube algorithm is proposed to accelerate cube distributed computing.Finally,a simulation experiment is carried out,and the results show that the improved segmented cube algorithm is more efficient.(2)The thesis puts forward and designs the dimension rowkey storage model of HBase cube to deal with the problem of low efficiency of data cube optimization for OLAP query,then realizes the OLAP cube's distributed storage according to the dimension.Firstly,it studies the loading technique of OLAP cube in HBase to realize the rapid loading of large data volume.Secondly,doing research on the OLAP cube in update technology,which realizes the original Data and cube data consistency.Finally,doing research on the OLAP cube compression storage technology and the three data compression algorithms of Gzip,LZO and Snappy to reduce the HBase cube storage space.Finally,the thesis designs simulation experiments of the cube loading and compression algorithm,the results show that the loading method of Bulk Load can load faster and Snappy is more suitable for cube compression,improving OLAP query performance.(3)On the basis of above research,the OLAP query mode is designed to realize the query and scope query of the cube.The thesis proposes an online query strategy in update phase and the online aggregation algorithm is designed to achieve uninterrupted during update and suppose fast query.The verification experiment of the cubic query is conduct to show that the cubes generated by computing and storage strategies proposed in the thesis can optimize the OLAP query.Meanwhile,the online query strategy can speed up the query speed when updating.In summary,optimization scheme can improve the efficiency of OLAP query effectively.
Keywords/Search Tags:OLAP query optimization, segmented cube algorithm, dimension rowkey storage, online query strategy
PDF Full Text Request
Related items