Research On Key Technologies Of Resource Scheduling In MapReduce

Posted on:2016-07-07

Degree:Doctor

Type:Dissertation

Country:China

Candidate:B Wang

Full Text:PDF

GTID:1318330536950227

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The era of big data presents many challenges and requires new ways to store, manage, access and process the colossal amount of available data. Map Reduce is a programming model and an associated implementation for parallel large data sets processing on clusters with hundreds or thousands of nodes. Due to its scalability and ease of programming, Map Reduce has been adopted by many companies and also optimized by many researchers. Due to the large quantities of research works, Map Reduce is becoming more and more e ffi cient. With the developing of both hardware and software technologies, new hardware platforms and new data clusters emerges, which also challenges the performance of Map Reduce, e.g., when Map Reduce running on heterogeneous clusters,many-core clusters, or computing high performance scientific applications, the performance degrades significantly. To solve these problems, this paper mainly optimizes the performance of Map Reduce when running on heterogeneous clusters, with skewed input data, and on many-core clusters. The main contributions and innovations of this paper are as following:1) This paper proposes Act Cap, a solution that uses a Markov chain based model to do node-capability-aware data placement for the continuously incoming data. To do so, Act Cap utilizes a two-state Markov chain to describe a node's behavior, and devises a new algorithm to predict the nodes' capabilities of data processing and to do data placement accordingly. The experimental results show that Act Cap can reduce the percentage of inter-node data transfer from 32.9% to 7.7% and gain an average speedup of 49.8% when compared with Hadoop, and achieve an average speedup of 9.8% when compared with Tarazu, the latest related work.2) This paper proposes Skew--, a coordinated systematic solution synthesizing various optimization techniques. The key contributions of Skew-- are: 1) it devises a new sampling-based automatic approach to determine the complexity of Reduce tasks and presents Complexity-Aware Keys Assignment, a post-Map keys allocation scheme that takes into account not only the number of keys, but also the Reduce task complexity and the key group size to balance the loads among Reducers;and 2) it proposes Locality-Aware Reducers Selection, Full Mappers Execution,and Shu ffl e-Type Identification, three mechanisms that guarantees the benefit of Complexity-Aware Keys Assignment by enhanced data locality and more e ffi cient resources scheduling. Experiments on a 7-node cluster with 13 benchmarks show that Skew-- can get an average speedup of 1.98 x, 1.63 x, and 1.77 x over YARN,Skew Tune, and Online Balancer respectively. In addition, it can achieve an average speedup of 1.41 x in the Reduce phase over Top Cluster, the latest work that is most similar to ours.3) This paper proposes mp Cache, a new approach to cache both Input Data and Localized Data to speedup the IO-intensive phases. An algorithm is used to dynamically tune the allocations between Input Cache and Localized Cache to make full use of the cache and provide better performance. We also propose an algorithm to replace Input Cache e ffi ciently, taking replacement cost, data set size, access frequency, all-or-nothing into consideration for better performance. The experiment results shows that mp Cache gets an average speedup of 2.09 x when compared with original Hadoop and achieves an average speedup of 1.68 x when compared with PACMan.

Keywords/Search Tags:

MapReduce, Heterogeneous Clusters, Skew Data, Many-Core Clusters, Markov

PDF Full Text Request

Related items

1	Mapreduce Job Scheduling For Heterogeneous Geo-distributed Clusters
2	Improving MapReduce performance in large-scale clusters
3	Research On Scheduling Optimization In Heterogeneous Hadoop Clusters Based On Dynamically Adjusting Node Resource
4	A Method For Partioning Even Clusters Based On Grids
5	Research On Task Scheduling Algorithms In MapReduce Clusters
6	A Resource Management Strategy Based On Data Compression Ratio For MapReduce On Hadoop Clusters
7	Using mobile data collectors to federate clusters of disjoint sensor network segments
8	A model for resource-aware load balancing on heterogeneous and non-dedicated clusters
9	Research On Load Balance Algorithms Based On PC Clusters
10	The Research Of Handling Data Skew In MapReduce Computing Model