Font Size: a A A

A Model Of Replica Management Based On Distributed Parallel File System HDFS

Posted on:2011-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:J W HeiFull Text:PDF
GTID:2178360305454903Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Now in a highly information-oriented society, the traditional data storagetechnologies such as databases, and data processing technology has been unableto meet the needs of technological development, the emerging mass data storagetechnology, data processing technologies continue to emerge. If google usesa parallel file system GFS, massive data processing MapReduce programmingmodel, as well as Yahoo, Facebook and other companies to support Hadoopopen-source software systems.In the parallel file systems, a copy of its key components. Design ofDistributed Parallel File System important role is to coordinate thecompletion of the various properties of nodes in the lower or higherperformance, higher volume tasks, poor reliability, the node will be composedof a highly reliable system that is the way to achieve this goal theestablishment of a copy, how to manage these copies of resources, and deposita copy of strategy, change strategies, is a distributed parallel file system,one of the important tasks.This paper analyzes the distributed parallel file system, HDFS, presenteda copy of HDFS management model. This article has two main components, as partof an analysis of HDFS design concepts, compare it with other distributedparallel file system, similarities and differences in order to understand theideas and objectives of HDFS; the other is a dynamic copy of the proposedmanagement model HDFS proposed based on historical statistical records tocreate a dynamic copy of the strategy and the deletion strategy, while a copyof the proposed placement strategy. Hadoop consists of two parts, the bottomis the Hadoop Distributed File System (HDFS), responsible for the managementof metadata from the master node NameNode and store data DataNode composition. HDFS the upper layer is MapReduce, it is a computing model based on HDFS byJobTracker and TaskTrackers composition. JobTracker for task partitioning andscheduling; and TaskTracker for receiving JobTracker from the distributionof Map or Reduce task, and perform these tasks, while the implementation ofthe results of the task back to JobTracker. Hadoop will be stored and computingdeployed in the same cluster node, so the node Hadoop used TaskTracker alsoHDFS storage node DataNode.HDFS has its own detailed design the objectives and the premise, which isit, and other distributed parallel file system where the difference is mainlyreflected in:1: Hardware Error: Hardware error is normal, not unusual.2: Streaming Data Access: The application running in HDFS applications ingeneral different, they are mainly based on streaming access to its data setsbased, HDFS is designed for batch processing.3: large data sets: on the run on HDFS applications use large data sets,HDFS in order to support large data sets as the goal, a typical file size maybe a few bytes of GB or TB.4: A simple consistency model: HDFS applications access to the files requiredto support a one-time writing, reading many of the Access Model.5: mobile computing, mobile data than the price low: If the data is nexton the implementation of the operation of these data, then the applicationrequests is calculated using the equipment would be more efficient.6: Cross-platform, heterogeneous hardware and software for mobile.While at home and abroad a copy of the distributed parallel file systemalready has a lot of research, but has not made a common standard, becausea copy of the management, there are many problems: such as creating the numberof copies of copies of granularity issue, a copy of the deposit, a copy ofwhat when you create or delete. In this paper these circumstances, inunderstanding the design goals based on the HDFS, presented a copy of themanagement model based on HDFS. A copy of the management model of the HDFS on a copy of the management must be a dynamic cluster in accordance with theconsolidation of all the nodes to make a decision. It must be integrated intothe node's CPU load, disk traffic and capacity, network bandwidth and otherinformation to decide whether it is necessary to create a copy of a copy ordelete.Of this article write to the file copy of the master copy of the defaultplacement of the following strategies: the master copy and one of the defaultsave a copy in the local rack (uploaded files are located under the same routercluster) and the other a copy of the default on In addition to the local rackoutside the other arbitrary a rack. Each rack machine choice: the choice ofequipment rack has two parameters indicators:①the number of stored datablocks;②cpu processing performance. Machine set up the first i-stored datablock number , cpu processing performance of , set the variable P = ,where is a constant factor. Calculated in the local frame of the P valuesof all nodes, choose the smallest P value is two machines to create a mastercopy and one of the default copy of all the nodes in the remote rack calculatethe P value, choose the smallest P value is a machine to create another a copyof the default. Skip the process of selection that already exists in this datablock copy machine; simultaneous detection of space, enough space to save acopy to skip machine. Place the other copy of the strategy as follows: atregular intervals to check the history to access records to check whether thenumber of requests for a file exceeds the predetermined threshold, if thereis such a document, then calculating the total number of visits that the largestrack This frame is called the best rack. System in this frame, select the nodeto create the lightest load a copy of the file, and clear on the document thehistory of access to records, re-statistics.This paper presents a copy of the record based on historical access to createa policy. N by a data pre-visit records to determine if it is hot data, anda copy of its creation. Will judge the hot spot data with N times the historical record of the combination can improve the accuracy and to make a copy of thecreation of predictable.NA as a file so that the number of visits, NA (i) for the first i timesthe number of visits, NF threshold for the number of visits for each timeinterval history of h, then the historical record to date total H, its totalthe number of R. Suppose N times the previous analysis of the historical record,then create a copy of hot data files the following steps:(l) calculated N times before the dynamic access characteristic functionP = ;(2) for each history h in descending order according to P value for theP to set a threshold for the MP = , MP for the hot spot data files can becopied in the P value the smallest file, deleting the value is less than MPand all subsequent history;(3) As long as H is not empty:①H in the pop-up records of h;②right h produced a copy of the file records and create a copy of the;③update records h, P (h) = P (h)-MP, if P (h)> MP, will be re-insertedH in h, and re-descending order.In this paper, a simplified access to records based on historical dynamicsof copy of the deletion policy, if a file In addition to a copy of the mastercopy and the default, there are other copies, and in N cycles, the file isless than a threshold number of visits to NL, then the delete one of its othercopies.
Keywords/Search Tags:Distributed Parallel File System, HDFS, replica management
PDF Full Text Request
Related items