Font Size: a A A

Research And Optimizing Of Data Storage Under HDFS

Posted on:2014-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhangFull Text:PDF
GTID:2248330398457669Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, cloud computing has been widely studied and application, and quickly became the most popular topic in computer field. Cloud storage is a new concept which extended and developed based on the cloud computing, and HDFS storage system of the Hadoop framework is the most famous one of it. Studies have found that there are a lot of duplicate data in the network, repeated storage of duplicate data can cause huge waste of space; And the number of small files is numerous, furthermore for frequent request, all requests are processed by the only NameNode in HDFS system, will cause performance fall sharply of the whole system.First of all, the paper thoroughly presented the system architecture and implementation technology of the Hadoop, and introduced the related deduplication technologies, simultaneously analyzed the shortages when dealing with a large number of small files based on HDFS, provided theoretical basis for the further research of the thesis.Based on the traditional HDFS architecture, the paper proposes a new HDFS architecture, simultaneously design the metadata management and file operation processes. For a large number of duplicate data and small files in the network, respectively designed the corresponding processing strategies. The main research content and innovation of the paper are as follows:(1) A new HDFS architecture has been raised based on the traditional HDFS architecture which add a new NameNode in each rack to responsible for the handling of the affairs of the rack. Analysed metadata cache and recovery mechanisms of primary NameNode and rack NameNode, and the metadata acquisition process of file operations has been redesigned.(2) Aiming at the problem of duplicate data, this article adopts the way of a dual-factor authentication system. First designed the keyword extraction strategy and to hash extraction results, on this basis, combined with text similarity matching techniques to complete the detection of duplicate data. This strategy avoids the drawbacks of fixed-length block deduplication, the judgement of duplicate data is more intelligent, finally saved storage space meanwhile enhanced the accuracy and scientific of the data deduplication.(3) In view of the small file processing, linked with the proposal of small file merging, structure, the contents of the cache, and update mechanism of metadata were analyzed. At the same time, read, write and delete operations process of the small file has carried on the detailed analysis and design. Due to the small files to merge, saves system storage space, and the NameNode in the rack completed most processing of the requests of the rack, effectively alleviate the burden of the primary NameNode, so as to further optimized the performance of the system.According to design scheme, the article finally has carried on the corresponding simulation experiments, it can be seen from the experimental results, the design has different degrees of ascension of performance on accurate and scientific of data deduplication, small file I/O speed and the NameNode memory usage and CPU utilization, which illustrates the effective and scientific of the design.
Keywords/Search Tags:Cloud Storage, Hadoop, HDFS, Distributed, Data Deduplication
PDF Full Text Request
Related items