Font Size: a A A

The Research For Replica Strategy Using Distributed Parallel File System HDFS

Posted on:2014-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:C F HuangFull Text:PDF
GTID:2268330425451887Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the further development of science and technology, the data scale of global is increasing rapidly. Web2.0is more focus on user interaction, it changed the users from the readers to the creators for internet..due to the massive information, traditional storage systems can not meet the requirements of high-speed growth of the amount of information. There are bottlenecks in capacity and performance requirements, such as the limited from number of hard drives and servers.HDFS (Hadoop Distributed File System) is different from the traditional distributed parallel file system, Running on inexpensive machines, They are new distributed file systems with high throughput, high fault tolerance, high reliability. With the distribution of data storage and management functions and provide high-performance data access and interaction.In distributed parallel file system HDFS, a replica is the the HDFS important part, the replica technical coordinates the Internet node resources to complete efficient and workload of the larger task. To achieve this task by replica placement, replica selection, replica adjustment,etc. To improve the effective transmission of data between resource nodes.This paper first introduces the current replica management strategy, summarizes the research achievements of predecessors in the field and their limitations. Then analysis the HDFS system architecture and I/O mechanism, etc to establish the HDFS dynamic replica management model, discuss replica placement and de-duplication.and then, improved algorithm design by replica placement strategy. Proposed a modified replica placement based on distance and load.Put forward balance factor to adjust the proportion of distance and load to meet the requirements of the different users. At the same time,According to the demand of the replica adjustment stage, improving replica de-duplication strategy, proposing the replica evaluation function and a de-duplication strategy based on replica valuation. Finally, verify replica strategies validity through the simulation experiment and compared to the HDFS default replica strategy.The main contribution of this paper is to:1)Analysis of the difference between HDFS with the traditional distributed systems, focus on a comparative analysis with the GFS, analysis of both design ideas and principles, compare similarities and differences of the replica management strategy. Facts show that HDFS is GFS simplified design, with a more flexible operational.2)Propose the replica placement strategy based distance and load information. It changed random memory algorithm in HDFS, considering a replica size, the transmission bandwidth, and node load three influencing factors, by calculating the utility value of a node to select best node. Finally, validating the superiority of the algorithm in the load balancing through Simulation experiments.3)Propose the de-duplication strategy based on the valuation. When the replica request for a write operation, Namenode obtains a set of Datanodes randomly. Selecting a node and writing data.If the selected node has amount of replicas and heavy load, the performance can not be effective; The HDFS default replica adjustment strategies ignore the status of the nodes. Improved strategies phase out the minimum value replica by calculating a replica value to free up storage space. Simulation results show that this strategy has higher performance compared to the the HDFS default strategy in large file writing test.
Keywords/Search Tags:HDFS, Replica Management, Replica Value, Load Balance
PDF Full Text Request
Related items