Font Size: a A A

MapReduce Performance Research And Optimization Based On Block Aggregation

Posted on:2015-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:J LiFull Text:PDF
GTID:2268330425489012Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
ABSTRACT:With the rapid development of computers and Internet technologies, Cloud Computing technology emerges as the times require. Because that big data cannot be stored in a single computer and traditional serial processing needs a lot of time overhead, how to efficiently process big data has been a problem which urgently needs to be solved. As a computational model supporting for distributed and parallel big data processing, MapReduce has been widely adopted in many data-intensive application fields, such as machine learning, data mining and scientific computing. Hadoop is an open source implementation of MapReduce computational model, which has been used in the data sets of search logs and user access logs through the way of data mining by a lot of enterprises like Yahoo, Amazon and Facebook. Although the use value of Hadoop has obtained everyone’s approval, it still has many problems whose performance needs further improvements.The core components of Hadoop include Hadoop Distributed File System (HDFS) and MapReduce computational framework, which are open source implemented versions of Google File System (GFS) and MapReduce. Through deeply researching and practicing in HDFS and MapReduce, this paper discusses the problem that sharing clusters of Hadoop cannot guarantee jobs which have all kinds of data size run efficiently. When a sharing cluster of maintains the parallel performance of jobs which have less data size, jobs which have large data size in this cluster can produce many map tasks. The above situation brings much pressure to the master of cluster and costs more resources when initializing map tasks. Combining the Hadoop Distributed File System (HDFS) with the distributed and parallel computing framework (MapReduce), this paper proposes, the adaptive splitting algorithm based on block aggregation, which calculate the split size accordance to the actual data size, the number of data files and computing resources which jobs can use. This mechanism of cluster task distribution makes the data group a split and stored in the same node. This algorithm ensures the parallelism of jobs with different data size, and appropriately reduces the number of map tasks of jobs which haves a large amount of data. It can reduce the cost of initializing tasks and pressure of masters, and improved running performance of clusters effectively.The current Hadoop implementation assumes that computing power of all the nodes in a cluster is same. Data locality has not been taken into account for launching speculative map tasks, because it assumes that most maps are data-local. Unfortunately? both the homogeneity and data locality assumptions are not satisfied in Heterogeneous Hadoop clusters. The native Hadoop fails to guarantee high performance in such environment. In this paper, we also launche discussions on this problem and propose an optimization scheme to improve the overall performance under the Heterogeneous environment. In this scheme, master of the cluster computes a split size according to the computing capacity of nodes and the real input size, and stores the data in the range of a split to the same nodes, which let each node has a balanced data processing load. Experimental results on the real application show that our optimization scheme on Heterogeneous Hadoop clusters can run computational tasks more efficiently.
Keywords/Search Tags:Cloud Computing, GFS, Hadoop, HDFS, MapReduce, BlockAggregation
PDF Full Text Request
Related items