Font Size: a A A

The Research Of Improving Performance Of Hadoop Cluster

Posted on:2016-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:S XiongFull Text:PDF
GTID:2308330470966151Subject:Computer technology
Abstract/Summary:PDF Full Text Request
New age has witnessed the rapid development of big data, which has been the important topic in the IT field. There emerged much excellent distributed system and the computing framework technology, for example, Hadoop and MapReduce since published have attracted wide attention. A great many of large-scale companies has utilized the Hadoop cluster as the main platform for their storing data and analyzing data. Deploying Hadoop has been a trend of the technological development of IT industry. However, every new technology has its limitation more or less. For Hadoop, with its programs used to build basis platforms in increasing corporations, some limitations of Hadoop have been exposed, among which the most concern is the performance of cluster. The thesis has put forward solutions to two aspects of problems existing in the performance of Hadoop platform relatively, which has positive effects on the development and optimization of the whole Hadoop project.The first issue is about the impact of data distribution on the performance of cluster. As we all know, the data locality is the key factor of influencing the task performance of Hadoop. There is a saying that mobile computing is better than mobile data, which is typical description of data locality. The data locality means no extra spending of data transmission. However, in fact, if the Hadoop cluster is homogeneous, every physical node has same computing performance; the data distribution of native Hadoop will distribute data into every node of the cluster according to the data backup mechanism. This distribution strategy is very effective under this circumstance of homogeneous cluster, yet it will lead to the deficiency of data locality due to different calculated performance between different nodes needing data transmission in the case of heterogeneous cluster, which will also decline the clustering performance of Hadoop. Thus we will in the paper go into this issue and come up with the concrete the data distribution scheme aiming at achieving load balancing of data in cluster, that is, allocating data in accordance with the level of calculated performance of physical machine. The thesis has proved the superiority of the data distribution mechanism through experiment.The other one is about the impact of shuffle phase on the performance of cluster. The shuffle phase is the most important part of MapReduce job execution, in other words, the performance of shuffle phase directly affects that of the job execution, intuitively reflected in the running time of job. Moreover, to avoid network congestion, we have proposed a preshuffling arithmetic to optimize the current shuffle scheme, which can increase the throughput of Hadoop clusters, by preprocessing intermediate data between map phase and reduce phase. Its concrete realization constitutes the active push data model, which can reduce the average waiting time of reduce tasks, and the data transmission pipeline between map tasks and reduce tasks that can improve the efficiency of data transmission. The dissertation on the basis of experiment has proved that the scheme can accomplish optimization of all above-mentioned problems and reduction of response time of the jobs.
Keywords/Search Tags:Hadoop, MapReduce, performance improvment, shuffle optimization, data distribution
PDF Full Text Request
Related items