Font Size: a A A

Join Prpcessing And Optimizing On Large Clusters

Posted on:2012-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:D X ChangFull Text:PDF
GTID:2178330335964780Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The quickly increasing of data types in the modern applications leads to the exponential growth of various volumes of data. While the volume of data expends extremely, more complicated processing requests have appeared, too. In Web applications, more and more tasks about large volume of data search and analyze have arisen gradually. All traditional data processing technologies, no matter centralized or distributed, can't provide an efficient approach to solve the data processing requirements of large volume of data. It's because the vast amounts of data processing requirements are beyond the capabilities of traditional relational databases.On the contrary, the large scale clusters have been used more and more in the data-intensive computing. It's due to three features of large scale clusters. First, they're scalable. The clusters can add or reduce the number of nodes depend on use's requirements. Second, they provide fault tolerance. Clusters often have three data replicas. When an error of original data occurred, the system will terminate the operations on the current node, and continue them on a backup node. Third, they have high availability. If a job is interrupted by some hardware or software failures, cluster will go on the job on other node in order to maintain the high availability.Based on these advantages of large scale clusters, we carried out research work of join operation over clusters. Join is one of the classic operations in database. It solves the problem of extracting information from multi-tables which have common attribute very well. So the join algorithms play an important role in various applications all the time.The contributions of this paper are as follows:1. We compare the costs of Map phase, Shuffle phase and Reduce phase during the join processing, and analyze performance bottleneck. We first implement naive join algorithm based on Map/Reduce framework over large scale clusters. Then we compare the execution costs of Map phase, Shuffle phase and Reduce phase according to experimental results. At last we find that massive data transmission during Shuffle phase is the main factor which restricts the join performance. 2. We provide a pre-hash method which can optimize naive join performance. We reorganize input data according to hash value of the common attribute during pre-hash processing. Then the tuples which have the same hash value will be stored together. The pre-hash method can improve join efficiency by reducing the number of times data transmission.3. We present an efficient method called Pre-hash Index Block Star Join (PHIBSJ) that is able to improve star join efficiency. The method creates indexes during pre-hash processing. Then it filters the unnecessary data using the indexes in order to reduce the amount of data transmission in the Shuffle phase and the amount of data calculation in the Reduce phase.From the algorithms cost model and experimental results, we can see that the two optimal methods could improve join performance efficiently using Map/Reduce framework over large scale clusters.
Keywords/Search Tags:Join, Query processing, Query optimization, Map/Reduce, Distributed computing
PDF Full Text Request
Related items