Font Size: a A A

Join Processing And Optimizing On Large Data Sets Based On Hadoop Framework

Posted on:2014-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:H SunFull Text:PDF
GTID:2248330395484011Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data analysis is an important functionality in cloud computing which allows a huge amount ofdata to be processed over very large clusters. MapReduce is recognized as a popular way to handledata in cloud environment due to its excellent scalability and good fault tolerance. However becauseof its own limitation, the performance of MapReduce is slow when it is adopted to perform complexdata analysis tasks that require the joining of data sets in order to compute certain aggregates.First, through the analysis of the shortage of a general two-way join algorithm--RSJ, aoptimization algorithm is proposed which is based on DistributedCache. The idea of this optimizationalgorithm is preprocessing the data before using RSJ algorithm. Preprocessing refers extractingconnection attribute values from one of join table and compressed into smaller "background" datastorage to a small file by Bit-map, and then transmission this small file to all nodes through theDistributedCache mechanism. At this time, if using RSJ can filter out much data in other table whichnot satisfy the connection condition by "background" data at Map stage. Thus reducing the outputdata from mapper to achieve the effect of optimization.After, because the one-to-one shuffling scheme, MapReduce need divide multiway join tasksinto a sequential subtasks which frequently checkpoints and shuffles intermediate results inintroducing a huge I/O overhead. Here, a new shuffling scheme which is one-to-many shufflingstrategy is used. This new strategy help us only performing one MapReduce task can completemultiway join tasks.Finally, in order to verify the above two kinds of optimization algorithm, we have doneextensive experiments on top of Hadoop platform. From the experimental results,we can see that thetwo optimal methods could impove join performance effciently using MapReduce framework.
Keywords/Search Tags:Join, Cloud Computing, MapReduce, Hadoop, HDFS, Bit-map, DistributedCache, Shuffling Strategy
PDF Full Text Request
Related items