Font Size: a A A

Efficient Star Join For Column-Oriented Data Store In The MAP Reduce Environment

Posted on:2013-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:H T ZhuFull Text:PDF
GTID:2248330374967147Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the web applications, the volume of data has been extremely expanded in the filed of scientific research, electronic commerce, web applications and so on. The amount of data at most commercial companies such as Wal-Mart and Taobao has reached the PB level. The massive data need effective management, efficient analysis and decision support. However, the traditional data warehouse and business intelligence system no long meet the requirement. In addition, new requirements, including efficiency and scalability on the massive commercial data analysis, have been proposed by users. Hence, we study the issue on massive commercial data management and provide the solution in Hadoop environment. We discuss on the optimization problem of massive data analysis based on the MapReduce framework and focus on the classical join operation, i.e., star-join, in data warehouse. Our solution improves both efficiency and scalability by HdBmp index and data placement. The extensive experiment shows the efficiency and effectiveness of our proposed join algorithm.The contributes of this thesis are as follows:· HDFS-based layout for data and HdBmp index we proposed a commercial data placement approach based on HdBmp index on Hadoop. Firstly, data files are prop-erly partitioned in a column-oriented manner to acquire a good placement in HDFS which provides the basis for following work on data analysis. Then, a novel data indexing called HdBmp index is used here. It is a non-embedded and data in-dependent index with ultimate goal to improve the efficiency of the massive data analysis. In addition, HdBmp index also supports the index creation and update in a distributed manner, which makes our system scalable and robust.· HdBmp join we proposed HdBmp index based on MapReduce join algorithm, i.e., HdBmp join algorithm. Naive MapReduce join methods transmit too much useless data in the join processing which brings to a high load on network bandwidth. For star join, we proposed an improved join algorithm to reduce the number of MapRe-duce jobs. Further, HdBmp join algorithm based on HdBmp index are proposed which filters out most of the useless tuples in join results by join plan to make star join much more efficient. The experiment shows HdBmp join algorithm has significant advantage on massive data processing and analysis.· Extensive experimental comparisons The data set used in our experiments is gen-erated by the TPC-H benchmark data generation tool. Our testing covers all the aspect of HdBmp join and also is compared with the improved star join algorithm, i.e., IM algorithm. Experimental results show that HdBmp join algorithm outper-forms IM algorithm on performance as well as scalability. Also, we discuss how to improve HdBmp join algorithm as a reference for further work.
Keywords/Search Tags:Distributed system, Star join, Column store, HdBmp index, HdBmpjoin
PDF Full Text Request
Related items