Font Size: a A A

Research And Implementation Of HDFS Replica Management Tool Based On File Access Heat

Posted on:2018-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:S C ZhaoFull Text:PDF
GTID:2428330542488040Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of Web technology,a large amount of data has been produced.In order to store and analyze massive data,the related concepts such as cloud storage,cloud computing,large data analysis and data mining have been put forward.At present,Apache Hadoop has become the main in distributed data processing;it can effectively improve the efficiency of big data processing.In the Hadoop framework,data replica management technology has always been a hot and difficult research.Although a lot of research work has been done on HDFS data replica management,it is still the key problem to be studied and solved about how to set the appropriate number of replicas to adapt to the change of file access and how to place the replicas to optimize the cluster load.In this thesis,according to the survey of research status and deficiency of data duplication management of HDFS,the research work is carried out aiming at the optimal number of replicas and replica placement problem in the data replica management of HDFS,based on which replica management tool is developed.About the optimal number of replicas,this thesis proposes a method to calculate the number of copies based on file access heat.This method defines the concept of file access heat to measure the frequency of file access.Based on this,the file access time is predicted by time series analysis technology to reflect the dynamic change rule of file access time with time.On this basis,the relationship between the number of file accesses and the number of copies is established,which gives the formula of the number of replicas and the corresponding algorithm,which can effectively avoid the difficulty of adapting to the change of access heat,which can effectively improve the cluster utilization and achieve load balancing.For the problem of replica placement,this thesis proposes a Join access relevance measurement method based on the relationship of Join access among documents,and does not consider the influence of Join access relevance on replica placement.The mathematical model of the replica placement problem considering Join access relevance is given,and a heuristic algorithm is given to solve the problem.By considering Join association among files,a file block with Join access relevance will be placed on nodes with lower communication cost,which can reduce the data transmission cost and guarantee the execution time of the job.On this basis,this thesis uses the theory and method of software engineering,gives the design and implementation of HDFS replica management tool.This thesis gives the use case analysis of HDFS replica management tool,and then gives its architecture design,function design and database design.The module of forecast of the number of replicas,the module of dynamic replica adjustment,the module of the Hadoop cluster state,the module of the access log acquisition,the database and so on are designed and implemented,and key functional modules are tested.Finally,a series of experiments are carried out to test the proposed method and the related algorithms.The experimental results show that the method proposed in this thesis has many advantages in terms of cluster concurrency,task execution efficiency and cluster load balance.At present,the replica management tool developed in this thesis has been applied to the system of the health big data management and analysis service(National Science and Technology Support Project);the application results show that the tool not only meets the demand but also improves the efficiency of data storage and analysis of the platform to a certain extent.
Keywords/Search Tags:Hadoop, HDFS, Replica, File access heat, Load balancing
PDF Full Text Request
Related items