The Research Of Handling Data Skew In MapReduce Computing Model

Posted on:2016-08-17

Degree:Master

Type:Thesis

Country:China

Candidate:Y F Gao

Full Text:PDF

GTID:2308330461450899

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the arrival of the era of "big data", many applications involving massive data processing has been emerging. Due to the scalability, high availability and fault tolerance and other limitations, traditional distributed database, parallel data management and processing technology of database and data warehouse system, has been unable to adapt to the massive data storage and processing.In large-scale data analysis and processing, cloud computing platforms need the support of data intensive computing model. The data intensive computing model, Map Reduce, first proposed by Google, is mainly used for the processing and analysis of large data sets. It makes full use of distributed computing and storage resources. Data processing tasks and computing tasks are assigned to thousands of cheap physical nodes, providing mass storage capacity and computing ability.However, balance problem always appears in Map Reduce model during the execution of the task, affecting the operating efficiency of the task. The Map task and Reduce task should avoid input data skew problem, which will lead to some sub tasks running slow, seriously affect the performance of Map Reduce. In addition, in the process of the data connection, data skew occurs when the frequency of occurrence of some numerical value are far high than others.Aiming at solving the data skew problem in Map Reduce computing model, a data processing algorithm, named HVBR-SH(Hash Virtual Balance Repartitioning based Skew Handling), is presented in this paper. In the Map phase, virtual partitioning method is applied, so the <Key, Value> pairs can be discretely stored, providing more combination types for the subsequent repartitioning process. In the Reduce phase, applying balance repartitioning method for continuous virtual partitions, the collected virtual partitions from the Map phase are repartitioned into new partitions the same number as Reduce tasks, which ensures the number of the biggest partitions is minimum in all partitions. Therefore, the running time of the whole Reduce phase will be improved. Experimental results show that HVBR-SH can effectively balance the input data size of various Reduce tasks and control the running time. As a result, it can handle the data skew in Map Reduce and improve the efficiency of running Map Reduce job.To optimize the mass data joining and processing over the Map Reduce-based large clusters, the paper proposes a Map Reduce data joining and processing mechanism based on the pre-Hash and index technologies, which first hashes the output data of Map tasks and generates the index of <Key, Value> pairs. Then, based on the index information, the time complexity of joining and processing the same <Key, Value> pairs is calculated. Finally, according to the time complexity, the system assigns the corresponding amounts of data to the Reducer nodes with load-balancing. The experimental results show that the Map Reduce data joining and processing mechanism proposed in the paper can efficiently promote the load-banlance of Reducer nodes, and improve the joining and processing efficiency over large clusters significantly.

Keywords/Search Tags:

MapReduce, Data skew, Virtual partitioning, Data linking

PDF Full Text Request

Related items

1	Research On The Clustering Algorithm Of Parallel Partition Based On MapReduce
2	Load Balancing Algorithm Based On Data Skew Of MapReduce
3	A Key-Value Skew Model Based Dynamic Data Partitioning Algorithm In Spark
4	Algorithm To Deal With The Problem Of Data Skew In MapReduce Model
5	Research Of Join Algorithm With Skew Data On Mapreduce
6	The Research Of Skew With Sampling Technique In MapReduce
7	Research And Strategy On Data Skew Problem Based On MapReduce
8	Research On Resource-aware Skew Mitigation For Mapreduce
9	Research On Partition Loading Balance Based On Spark Data Skew
10	The Research Of Scheduling Algorithms For Performance And Energy Consumption Under The Condition Of Data Skew