Font Size: a A A

The Research And Implementation Of Replication Management In HDFS

Posted on:2016-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhangFull Text:PDF
GTID:2348330473963407Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of information technology makes the data of internet grow quickly,and cloud computing emerges in the big data.Cloud storage system is the foundation of cloud computing,and it uses a large number of distributed,low-cost storage resources to provide users with a large capacity,high performance storage services.Replication management is one of the key technologies of cloud storage system,which can improve reliability,fault tolerance and scalability of the system,enhance user access concurrency,and improve system service performance.HDFS is an open source distributed storage system,and has been widely used.Although the storage capacity in HDFS has great advantage,there are still some deficiencies in replication strategy.This paper describes the process of replication management in detail,and analyzes various factors which impact the performance and efficiency of the storage system in the scenarios of a large number of concurrent users.To consider user access behavior,the service cost of node and load balancing,this paper presents a new replication management strategy.The main work is as follows:(1)The replication placement in HDFS does not consider the heterogeneity between nodes.If a node with low performance has too much data,when accessed,it will cause a heavy load,bring the unbalanced load and influence the system service performance.This paper presents a replication management strategy based on node selection cost,this strategy consider the node performance and the impact of network bandwidth,then select the target node one by one.(2)The default replica strategy in HDFS does not change the number of replica,which cause a bottleneck in the nodes with high access frequency.This paper presents a dynamic replication creation strategy based on multi-stage access frequency.This strategy increases file replicas dynamically according to file popularity.It will increase the response speed of system effectively,and reduce the time to complete the job.Moreover,replicas that not be used in a long time should be deleted to reduce the management cost and the waste of resources.(3)In order to meet the needs of users about the access response time,this paper presents replication selection strategy based on the service capabilities of nodes.This strategy consider the node load and data transfer time to select the strongest node,which reduce the job's waiting time and improve user access efficiency.Through the simulation in OptorSim,comparison of replication management strategy in this paper,the strategy in HDFS and the strategy build-in in OptorSim,the results show that the average job time of response time is 10%lower than other strategies,the number of dynamic total replicas is 28%lower than other strategies and the effective network usage is 32%lower than other strategies.
Keywords/Search Tags:cloud storage, replication management, HDFS, file popularity
PDF Full Text Request
Related items