Font Size: a A A

Research And Improvement Of Data Check Strategy In Distributed File System

Posted on:2014-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:R JingFull Text:PDF
GTID:2248330395997457Subject:Network and information security
Abstract/Summary:PDF Full Text Request
This paper based on the Project of National Natural Science Foundation of China, carriedout research and analysis on cloud computing storage platform. The author in laboratory willfocus on Distributed File System as a research platform, DFS is currently very widely used todeal with large data, such as HDFS, GPFS, LustreFS.Distributed file system is responsible for the distribution of data storage, datamanagement and provides high throughput of data access performance. DFS maintainsfunctions in addition to read and write operation, but also has data check function, thisfunction will be applied to read and write data. It is well guaranteed data integrity. Readingand writing data from block may be appeared damage situation due to storage equipment,network and software defects reasons. For large data processing, the original complex heavycomputing tasks and data validation process can bring extra burden to distributed file systems,reading and writing rate will also slow down, then it would need to build a complete system,under the condition of data integrity as far as possible to reduce the effect of data validationbrings to the system.Lustre has two checksum modes: one kind is memory mode (when data in the clientcache), the other is a line mode (data transmission in the network line), to ensure data integrity.GPFS through its own disks, network Shared disk (NSD) and the GPFS File equipment threelayer architecture mechanisms to ensure data integrity, utilizes three usability judgmentmechanisms: File System Descriptor Quorum, Node Quorum, the Tiebreaker Quorum to fullyguarantee the data integrity and System safety. HDFS is the core part of Hadoop,MapReduceframework is in storage layer, also like Lustre, GFS etc system as an independent distributedfile system exists. HDFS uses the CheckSum and DataBlockScanner ways at the same time toensure that the data stored in the data on the node completely. HDFS DataNode in the localfile system to store data block metadata for CRC check. For each block, to request DataNodechecksum information, returns back information including block all the MD5checksum, ifattempts to a DataNode failed message, to another DataNode, finally puts all the pieces ofMD5together, and calculates the content of the MD5. The main work in the following aspects:(1) Describe the background of this study, leads to the concept of distributed file system dataintegrity check, presentation analyzing the concept and related technologies.(2) Analysis Distributed File System data integrity protection mechanisms, detail the methodof GPFS, Lustre and HDFS data integrity check model. Focus on profiling the HDFS datatransfer process checksum calculation, Hash algorithm and data transmission.(3) According to the front of foreshadowing, create a distributed file system data toverification Model DFS-DICM.(4) According to the model, aim at the data writing process, cache allocation, CRC32checksum algorithm variants optimized to improve the computational efficiency, andenhances system performance.(5) Use HDFS as an experimental platform for the improvement measures, use its own testingsystem benchmark to test the overall and individual performance, load balancing, datatransfer process and CRC32algorithm separately to optimize the impact on the system.(6) Compare and analysis the experimental results, obtain the paper conclusion and outlookfor further work.
Keywords/Search Tags:Cloud Computing, Distributed File System, HDFS, Data Integrity, CRC
PDF Full Text Request
Related items