Research Of The Data Placement Strategy And Performance On Heterogeneous Distributed System

Posted on:2018-12-07

Degree:Master

Type:Thesis

Country:China

Candidate:H Sun

Full Text:PDF

GTID:2348330563952607

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the arrival of big data era,the timeliness of data-intensive applications is becoming more and more important for us.Distributed computing framework Hadoop has become the main solution to deal with big data now.People usually promote the performance of distributed computing by adding new computing nodes to cluster.With the expansion of the scale of the cluster,it produces a large amount of power consumption because of lack of reasonable management strategy.Research shows that the most servers in data center are running at a very low efficiency.So how to make full use of computing resources in the cluster to improve the performance of the whole system and reduce the power consumption has become the main research direction of scholars and industrial circles.Firstly,this paper analyses the use situation of cluster in Hadoop.It shows users cannot make full use of computing resources in the cluster due to lack of effective configuration.It is not a long term solution to increase the timeliness of distributed computing by adding nodes to the cluster.Data storage in the distributed system is under the status of extensive management.By related research,it shows a good data placement strategy can greatly promote the performance of distributed computing.The current data placement strategy lacks consideration of the heterogeneity of the nodes in the cluster.As the expansion of the cluster,the original data placement strategy cannot balance the data allocation because of the difference in the capacity of nodes which will reduce the performance and produce more power consumption.For the above,in order to make best use of computing resources and reduce the power consumption,this paper first proposes to optimize a reasonable configuration of the parameters provided by Hadoop.This paper gives a reasonable configuration through the experimental result to achieve the best performance.The experimental result also shows it can achieve better performance by giving different computing nodes a different configuration.Comparing with the default configuration of Hadoop.it shows we can get better performance by parameter tuning with the consideration of heterogeneity of the cluster.In the second part,it shows the use status of memory can influence the performance of distributed computing by experimental result.This paper proposes a task scheduling mechanism based on memory usage prediction.In this task schedule,it predict the future use status of memory in the computing nodes by analyzing the use status before.The task scheduling mechanism can reduce the memory pressure by reducing the allocation of tasks when the computing node is under memory pressure.The task scheduling mechanism can be more flexible by setting the threshold of memory usage.This mechanism based on predicting memory usage can improve the performance of the system by making full use of the computing resources.In the third part,this paper proposes a data placement strategy based on the information of storage nodes.By managing the information of nodes it uses different management strategies among different machines.This data placement strategy fully consider the computing performance,storage capacity and data correlation.It can realize the load balance,make full use of the performance of the storage nodes and reduce the power consumption by using different management strategies.In order to measure the effect of experimental scheme on the performance of system,this paper builds Hadoop cluster environment to do the related experiments.Comparing with the default configuration,the time of running the same task is reduced by parameter tuning as well as the performance of system is promoted.Comparing with the default task scheduling mechanism,the average execution time of the task scheduling mechanism based on memory usage prediction is reduced by 6.625 S and the performance is promoted by an average of 4.25%.Finally,the experimental results of data correlation show that the data placement can improve the performance of the system and reduce the power consumption.

Keywords/Search Tags:

big data, high performance, prediction, data placement, task scheduling

PDF Full Text Request

Related items

1	Research On Data Placement And Task Scheduling Algorithm
2	Performance Optimization For Big Data Progressing Systems In The Cloud
3	The Research Of Energy Saving Data Placement And Task Scheduling Algorithms In Distributed Systems
4	Resource Allocation And Scheduling In Big Data Clusters
5	Research On Spark Task Scheduling Technology Based On Execution Time Prediction
6	Research On Virtual Machines Performance Interference Prediction And Placement Techniques In Cloud Data Center
7	Performance Optimization For Parallel Systems With Shared Dwm Via Retiming,Loop Scheduling,and Data Placement
8	The Research On High Performance Task Scheduling Technology Based On Mapreduce In Cloud Computing
9	Research On The Bulk Cloud Data Placement And Transfer Scheduling Among Datacenters
10	Big Data System Optimization Under High-speed Network Environment