Research On Hadoop Based Data Placement Strategy

Posted on:2018-01-26

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Wang

Full Text:PDF

GTID:2348330518973589

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In recent years,with the promotion of social networks and the popularity of intelligent devices,data growth is also accelerating.Hadoop,as an open source implementation of MapReduce,is becoming increasingly important in industry and academia,and the results of large data research directly affect economic and social development.In Hadoop system,data placement strategy affects the distribution and execution efficiency of MapReduce task,so the research of data placement algorithm has important theoretical and practical value.This article focuses on the default data deployment strategy and replica mechanism for the Hadoop distributed file system.In view of the shortcomings of the default data placement strategy,a two-stage data placement algorithm is proposed by using the performance of the nodes and the association between the data.On the basis of this algorithm,the K-means algorithm is used to adjust the placement position of the copy and reduce the pause time of the algorithm at run time.The main innovations in this paper include:1.A node performance evaluation algorithm based on PageRank is proposed.In this paper,a variety of Benchmark tests are used to evaluate the performance of nodes,and PageRank algorithm is used to calculate the value of each assessment score,and the evaluation scores are normalized.2.A two-stage data deployment algorithm(TSDP)is proposed.First,the data blocks are deployed to each node based on performance.And then according to the association between data,the data blocks are grouped,the second stage of the placement of data start.Experiments show that the efficiency of task execution is improved obviously compared with the consistent hash algorithm and load balancing algorithm.3.A new duplication placement algorithm is proposed.This paper first define the heat of the data,which is used to adjust the number of copies of the data.Then,the association between nodes is defined.Using the correlation of nodes instead of Euclidean distance,K-means algorithm is used to calculate the clustering center,which reduces the migration distance of replica and reduces the pause time of TSDP algorithm.Through the experiment,compared to the unmodified TSDP algorithm,the pause time is obviously reduced,but the task execution efficiency is slightly reduced.4.Design and Complete the development of data placement visualization analysis platform.

Keywords/Search Tags:

Hadoop, Data Placement, Performance Evaluation, Clustering, Replicas Management

PDF Full Text Request

Related items

1	Research On Dynamic Management Of Data Replicas In Heterogeneous Hadoop Cluster
2	Research On Dynamic Management Of Data Replicas In Heterogeneous Hadoop Clusters
3	Research On Parallelization Of Clustering Algorithm Based On Heterogeneous Hadoop Platform
4	Scientific Workflow Data Placement Method Based On Task Assignment And Dataset Replicas In Cloud Environment
5	Research On Hadoop Based Iterative Data Processing And Data Placement Strategy
6	Research On Distribute Storage Of Replicas Based On Hadoop
7	Research On Replicas Placement And Cache Optimization Of HDFS
8	Performance evaluation of big data placement structures in MapReduce-based data warehouse systems
9	Design And Implemention Of High Performance Text Clustering Algorithm Basic On Hadoop
10	Research And Implementation On Optimization Of Data Placement Mechanism In Hadoop