Font Size: a A A

Research On Hadoop Based Data Placement Strategy

Posted on:2018-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z WangFull Text:PDF
GTID:2348330518973589Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the promotion of social networks and the popularity of intelligent devices,data growth is also accelerating.Hadoop,as an open source implementation of MapReduce,is becoming increasingly important in industry and academia,and the results of large data research directly affect economic and social development.In Hadoop system,data placement strategy affects the distribution and execution efficiency of MapReduce task,so the research of data placement algorithm has important theoretical and practical value.This article focuses on the default data deployment strategy and replica mechanism for the Hadoop distributed file system.In view of the shortcomings of the default data placement strategy,a two-stage data placement algorithm is proposed by using the performance of the nodes and the association between the data.On the basis of this algorithm,the K-means algorithm is used to adjust the placement position of the copy and reduce the pause time of the algorithm at run time.The main innovations in this paper include:1.A node performance evaluation algorithm based on PageRank is proposed.In this paper,a variety of Benchmark tests are used to evaluate the performance of nodes,and PageRank algorithm is used to calculate the value of each assessment score,and the evaluation scores are normalized.2.A two-stage data deployment algorithm(TSDP)is proposed.First,the data blocks are deployed to each node based on performance.And then according to the association between data,the data blocks are grouped,the second stage of the placement of data start.Experiments show that the efficiency of task execution is improved obviously compared with the consistent hash algorithm and load balancing algorithm.3.A new duplication placement algorithm is proposed.This paper first define the heat of the data,which is used to adjust the number of copies of the data.Then,the association between nodes is defined.Using the correlation of nodes instead of Euclidean distance,K-means algorithm is used to calculate the clustering center,which reduces the migration distance of replica and reduces the pause time of TSDP algorithm.Through the experiment,compared to the unmodified TSDP algorithm,the pause time is obviously reduced,but the task execution efficiency is slightly reduced.4.Design and Complete the development of data placement visualization analysis platform.
Keywords/Search Tags:Hadoop, Data Placement, Performance Evaluation, Clustering, Replicas Management
PDF Full Text Request
Related items