Efficient Star Join For Column-Oriented Data Store In The MAP Reduce Environment

Posted on:2013-06-20

Degree:Master

Type:Thesis

Country:China

Candidate:H T Zhu

Full Text:PDF

GTID:2248330374967147

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of the web applications, the volume of data has been extremely expanded in the filed of scientific research, electronic commerce, web applications and so on. The amount of data at most commercial companies such as Wal-Mart and Taobao has reached the PB level. The massive data need effective management, efficient analysis and decision support. However, the traditional data warehouse and business intelligence system no long meet the requirement. In addition, new requirements, including efficiency and scalability on the massive commercial data analysis, have been proposed by users. Hence, we study the issue on massive commercial data management and provide the solution in Hadoop environment. We discuss on the optimization problem of massive data analysis based on the MapReduce framework and focus on the classical join operation, i.e., star-join, in data warehouse. Our solution improves both efficiency and scalability by HdBmp index and data placement. The extensive experiment shows the efficiency and effectiveness of our proposed join algorithm.The contributes of this thesis are as follows:· HDFS-based layout for data and HdBmp index we proposed a commercial data placement approach based on HdBmp index on Hadoop. Firstly, data files are prop-erly partitioned in a column-oriented manner to acquire a good placement in HDFS which provides the basis for following work on data analysis. Then, a novel data indexing called HdBmp index is used here. It is a non-embedded and data in-dependent index with ultimate goal to improve the efficiency of the massive data analysis. In addition, HdBmp index also supports the index creation and update in a distributed manner, which makes our system scalable and robust.· HdBmp join we proposed HdBmp index based on MapReduce join algorithm, i.e., HdBmp join algorithm. Naive MapReduce join methods transmit too much useless data in the join processing which brings to a high load on network bandwidth. For star join, we proposed an improved join algorithm to reduce the number of MapRe-duce jobs. Further, HdBmp join algorithm based on HdBmp index are proposed which filters out most of the useless tuples in join results by join plan to make star join much more efficient. The experiment shows HdBmp join algorithm has significant advantage on massive data processing and analysis.· Extensive experimental comparisons The data set used in our experiments is gen-erated by the TPC-H benchmark data generation tool. Our testing covers all the aspect of HdBmp join and also is compared with the improved star join algorithm, i.e., IM algorithm. Experimental results show that HdBmp join algorithm outper-forms IM algorithm on performance as well as scalability. Also, we discuss how to improve HdBmp join algorithm as a reference for further work.

Keywords/Search Tags:

Distributed system, Star join, Column store, HdBmp index, HdBmpjoin

PDF Full Text Request

Related items

1	Research Of Key Technology Of Index In Column-Oriented DWMS
2	Research And Optimization Of Multidimensional Data Warehouse Model Based On Column Storage
3	Research And Implementation Of Key Techniques For Query Rewriting In Column-Store Data Warehouse
4	Research On Distributed Memory Column Store Engine
5	Column Store Database---A New Approach to GIS Application
6	The Optimization Of The Query Execution Engine In Column Oriented DWMS
7	Research And Implementation Of Parallel Query Processing In Column-store
8	Research And Implementation Of Query Optimizing Of Column Store In Data Warehouse Management System
9	Research And Implementation Of The Bitmap Index In Column-Oriented Data Warehouse
10	The Design And Implementation Of An E-commerce System Based On Column-store And SAP TREX