Font Size: a A A

Research On Data Organization For Data De-duplication System

Posted on:2016-09-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:F YanFull Text:PDF
GTID:1108330503955326Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
De-duplication technology is playing more and more important role in the fast growing data storage systems. It is widely used in VTL(Virtual Tape Library) system, data backup systems, and data archiving system, etc. The core technology of de-duplication is partition files or data streams into contiguous data chunks, using hash function(such as SHA-1) for each data chunk to generate the corresponding summary information(called fingerprint value),and it identifies duplicate chunks by comparing their fingerprints with stored chunks. When detecting duplicate data chunk, only its metadata is stored, thereby reducing storage space consumption.Although in the field of data de-duplication system has done a lot of research, but in terms of data organization there still has sufficient optimizing space. Specifically, against de-duplication system certain data access mode, it needs to research more efficient storage architecture and data management policies to maximize storage potential and reduce storage energy consumption.This paper studies de-duplication system data organization, object-level de-duplication, metadata storage strategy and data restoration method, the main innovations are as follows:(1) Proposed a chunk-oriented cross grouping data organization, which utilizes continuous data access patterns of de-duplication system to reduce storage power consumption. De-duplication system generally uses RAID(Redundant Array of Independent Disks) to provide storage and data protection, de-duplicate data are uniformly distributed on the disk array, while it requires only a small number of disks that is sufficient to provide continuous data access I/O bandwidth. This paper designed a kind of RAID-5 cross grouping data organization method and the corresponding energy-saving disk scheduling algorithm. It uses parallel disk groups, and can adapt to different throughput requirements by adjusting the horizontal disk group size, meanwhile, reduce parity disk switching frenquency by adjusting a reasonable vertical group size. When access requests concentrate at a certain horizontal disk group, other groups may switch into standby state. This paper realized cross grouping layout based on the Linux MD(Multiple Device Driver) modules, and verified the layout effectively reduce 26% power consumption under 10 disks 3 groups storage configuration.(2) Proposed an object storage data organization for OpenXML compound files de-duplication system to realize energy-efficient object storage. This paper designed a kind of RAID-4 asymmetric-grouping object storage data organization method and disk grouping adjustment algorithm. The number of disks in each group can be adjusted as needed, there are two disk groups working in parallel, they are used to store volatile and non-volatile objects respectively. Predictive mechanisms are adopted to realize grouping adjustments, equalization adjustment algorithm calculates adjustment factor according to the system I/O performance requirements; while proportional adjustment considers storage needs of different object types. Asymmetric-grouping data organization is suitable to object storage, it can be adjusted according to the backup load variations. In the storage configuration with 10 disks 3 initial groups,disk group equalization adjustment and proportional adjustment can reduce power consumption by about 22% and 27%.(3) Proposed a cold&hot metadata storage organization strategy based on metadata access frequency to improve metadata access efficient. For reducing disk index access, most researchers focus on fingerprint lookup technology, ignoring the energy consumption problem introduced by index lookup and metadata storage. Metadata is divided into two categories ‘Popular’ and ‘cold’. Popular metadata storage uses cross grouping data organization. Index structure is based on B+ tree. Fingerprint table is splitted into sub-table which size is controlled by storage sub-block size. Metadata entry is stored according to the chunk arrival sequence in data stream; Cold metadata can be written to the disk group in a sequential manner, that is, we preserve the spatial locality by using append-style file structure when storing cold metadata. Popular and cold metadata are stored separately, in the storage configuration with 5 disks 2 groups popular metadata store and 3 disks 3 groups cold metadata store, it reduces 21% power consumption for metadata storage.(4) Proposed a data replication and restoration strategy based on data storage location to improve data restore efficiency.Store duplicate data can effectively improve data restore speed. Most previous studies rely on a high rate of repeat accesses to determine which data will be replicated, consisting of the latest backup data blocks are often distributed throughout the storage system, so de-duplication data restore will lead to a lot of random disk reads. This paper designed selective data replication and data restoration strategy based on storage location. Cross grouping data organization is further partitioned into regions. Introducing access distance matrix,chunk access distance can reflect storage region state, and then optimal reading areas can be choosed during data restoration. Under the cross grouping data organization(with 10 disks 3 groups), this optimization strategy can improve 22% restoration performance with the expense of reducing 7.4% de-duplication rate. For asymmetric-grouping object storage data organization, it can increase about 11% restoration performance compared to ordinary RAID-4.
Keywords/Search Tags:data de-duplication, data organization, metadata, data recovery, energy-saving storage subsystem
PDF Full Text Request
Related items