Join Processing And Optimizing On Large Data Sets Based On Hadoop Framework

Posted on:2014-01-10

Degree:Master

Type:Thesis

Country:China

Candidate:H Sun

Full Text:PDF

GTID:2248330395484011

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Data analysis is an important functionality in cloud computing which allows a huge amount ofdata to be processed over very large clusters. MapReduce is recognized as a popular way to handledata in cloud environment due to its excellent scalability and good fault tolerance. However becauseof its own limitation, the performance of MapReduce is slow when it is adopted to perform complexdata analysis tasks that require the joining of data sets in order to compute certain aggregates.First, through the analysis of the shortage of a general two-way join algorithm--RSJ, aoptimization algorithm is proposed which is based on DistributedCache. The idea of this optimizationalgorithm is preprocessing the data before using RSJ algorithm. Preprocessing refers extractingconnection attribute values from one of join table and compressed into smaller "background" datastorage to a small file by Bit-map, and then transmission this small file to all nodes through theDistributedCache mechanism. At this time, if using RSJ can filter out much data in other table whichnot satisfy the connection condition by "background" data at Map stage. Thus reducing the outputdata from mapper to achieve the effect of optimization.After, because the one-to-one shuffling scheme, MapReduce need divide multiway join tasksinto a sequential subtasks which frequently checkpoints and shuffles intermediate results inintroducing a huge I/O overhead. Here, a new shuffling scheme which is one-to-many shufflingstrategy is used. This new strategy help us only performing one MapReduce task can completemultiway join tasks.Finally, in order to verify the above two kinds of optimization algorithm, we have doneextensive experiments on top of Hadoop platform. From the experimental results,we can see that thetwo optimal methods could impove join performance effciently using MapReduce framework.

Keywords/Search Tags:

Join, Cloud Computing, MapReduce, Hadoop, HDFS, Bit-map, DistributedCache, Shuffling Strategy

PDF Full Text Request

Related items

1	Optimum Design Of Table Join Algorithm Based On MapReduce
2	The Design Of The Cloud Computing System Based On Hadoop
3	Optimization And Application Research Of MapReduce Computing Model Based On Hadoop
4	The Cloud Computing Based On Hadoop Platform And Log Analysis
5	Research On The Application Of Cloud Computing Based On Hadoop
6	Working Principle And Applied Research Of MapReduce
7	MapReduce Performance Research And Optimization Based On Block Aggregation
8	Research And Application Of The Characteristics Of Distributed Computing Of OSS/BSS In The Cloud Deployment
9	Join Method Research Based On MapReduce
10	The Performance Optimization And Improvement Of MapReduce In Hadoop