Font Size: a A A

Algorithm Rebuilding And Performance Optimization Of Mapreduce In Heterogeneous Environment

Posted on:2015-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhaoFull Text:PDF
GTID:2298330434950116Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Following the personal computer revolution and Internet revolution, cloud computing is viewed as the third IT wave, which will change the technical basis of the Internet, and even affect the entire industry pattern. It will bring fundamental changes in life, mode of production and business model.As a core component of Hadoop that is a cloud open platform, MapReduce is a distributed and parallel computing model based on mapping function for processing and generating large data sets. MapReduce abstracts business logic from implementation details, and provides powerful interfaces for programmers. It can shield the underlying specific implementation processes and reduces efficiently the distributed and parallel computing difficulty. And it has some features, such as high reliability, high scalability, high efficiency, high fault tolerance. However, MapReduce mechanism itself is not perfect and mature, and needs to be improved the efficiency further.According to analyzing principles and performance indicators of MapReduce, in heterogeneous environments, unreasonable resource scheduling, data transmission and system parameters are summarized. In order to improve the efficiency of MapReduce processing large data sets in heterogeneous environment, this paper puts forward the optimization strategies in three aspects:adaptive moving windows scheduling algorithm(MWSA), change data transmission protocol(from HTTP to UDT) and optimization of system configuration parameters. Among them, adaptive moving windows scheduling algorithm(MWSA) has the following advantages:(1)It has a priority-based scheduling algorithm, and assigns execution time and system resources based on priority;(2)Taking into account the heterogeneous nature of the cluster, based on different performance of different tasks, tasks are assigned to corresponding nodes;(3)According to the work load level of every node, it could automatically adjust work load balance, dynamically adjust the number of tasks running on TaskTrackers;(4)Data localization scheduling algorithm is improved and data localization strategy based on node waiting time is proposed;(5)Speculative execution strategies and identify stragglers (straggler) are improved, particularly for slow nodes, Map slow node and Reduce slow node could be distinguished from these slow nodes;(6)The number of tasks in the backup queues is controlled to prevent the tasks shaking. For changing data transmission protocol, UDT decreases the number of established connection during process of data transmission. The problem of inefficiencies of congestion control mechanism of HTTP in the long-distance high-bandwidth environment is solved. For optimizing system parameters, according to data compression of outputs of Map tasks, the number of transferred files are reduced and bandwidth costs become lower; the requirements of Reduce tasks for memory are reduced, and more memory space are available for storage of more outputs of Map tasks; adjustment of the ratio of Map/Reduce tasks could be more efficient for task allocation; the number of threads in the copier stage of shuffle are increased, it could improve the speed of large data transmission in shuffle stage.At last, this paper compares performance difference of MapReduce before optimization with after optimization. The algorithms are verified by experiments based on actual data and experimental programs of different angles, after rebuilding and optimization, the performance of MapReduce is greatly improved.
Keywords/Search Tags:Cloud computing, Hadoop, MapReduce, Scheduling algorithm, Transmission protocol, System parameter
PDF Full Text Request
Related items