Font Size: a A A

Research On HDFS Replica Placement Management Policy And Retrieval Algorithm In Heterogeneous Storage Environment

Posted on:2021-05-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y QinFull Text:PDF
GTID:2428330623968158Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,the total amount of data in the Internet has continued to rise,and data has become an important national basic strategic resource.Because it is difficult for traditional storage systems to break through the limitation of storage capacity,distributed storage systems are increasingly favored by the industry.Among them,HDFS is widely used to complete the storage of massive data in various big data application systems.At the same time,with the rapid development of hardware technology,storage media with faster read and write speeds continue to appear and be applied,such as solid state drives.Therefore,with the continuous expansion and iterative evolution of the HDFS system,the cluster form of HDFS has gradually evolved from the initial isomorphism to heterogeneity,and multiple storage media usually exist in the cluster.So in the new normal state of heterogeneous clusters,how to efficiently read and write file copies and use various storage media reasonably are the key issues to be solved urgently in a heterogeneous HDFS environment.Unfortunately,HDFS was originally designed for a homogeneous environment.The default replica placement policy,management policy,and retrieval algorithm are all designed for a homogeneous environment.They have many deficiencies in a heterogeneous environment.In terms of replica placement and retrieval,HDFS only considers the network distance when selecting the nodes that place replicas or provide read services,but does not consider the heterogeneity of the nodes and the difference in real-time performance,which can easily cause the problem of node load imbalance.In terms of management,HDFS uses a static replica management policy.Once the replica of the file is placed,its location and number will not change forever.It does not take into account that the file access performance will change over time,which may cause unreasonable allocation and lead to waste of space and the overall system.In order to solve these problems,this thesis first uses "temperature" to quantify the access characteristics of files,and at the same time assign different heterogeneous storage policies to files with different "temperature".By regularly updating the real-time temperature of the file,the system can sense the changes in file access characteristics in real time and then make corresponding adjustments in the position and number of replicas to achieve dynamic management of file replicas.Second,this thesis uses the "composite load" to quantify the real-time performance of the nodes,and a calculation method based on the analytic hierarchy process is proposed.By regularly updating the "composite load" value of the nodes,the system can obtain the real-time status of each node in time,and then build a multi-level global service queue and use the load balancing algorithm based on the node service queue to complete the distribution of read and write requests to achieve the data node level load balancing.In this way,this thesis has realized the full chain process optimization of HDFS replica placement,management and retrieval.After experimental verification,the optimized system can adjust the position and number of replicas in real time according to changes in file access characteristics.The average write speed of HDFS increased by 5.35%,and the average read speed increased by 11.84%.At the same time,the number of times HDFS hits the SSD has increased significantly when reading data.In short,through the optimization of this thesis,the I/O efficiency of the HDFS system is improved as a whole,the read and write delay is shortened,the storage cost of the system is reduced,and the node balance of the cluster in the system is also guaranteed.
Keywords/Search Tags:HDFS, heterogeneity, replica management, load balance
PDF Full Text Request
Related items