Font Size: a A A

Sparse Indexing For File-Level De-duplication

Posted on:2013-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y D ShangFull Text:PDF
GTID:2268330422474305Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data de-duplication is the research hotspot in storage domain which can both increasestorage usage efficiency and decrease the bandwidth transferring data, nowadays thistechnology is widely used in data back-up system, archival storage system and remotedisaster recovery systems. Enterprise data volumes are exploding as organizations collect andstore increasing amounts of information both for their own use and government regulations.However, much of the data in storage is duplicate data. In cloud computing data center thereare mass of virtual machine images and duplicate data, data de-duplication technology can beused to eliminate the redundant data, reduces storage needs and costs and increase thedispatch speed of virtual machine image.A significant challenge is that we cannot afford enough RAM to store an index of thestored data when doing duplicate identification and thus may be forced to access on-diskindexes for every input index. In order to solve this problem, Data Domain exploits chunklocality for index caching as well as for laying out chunks on disk, By using Bloom Filters,Stream-Informed Segment Layout, Locality Preserved Caching, Data Domain can avoid alarge number of I/O times. Sparse Indexing is another way of using chunk locality to solvethis problem. While these already using techniques are deeply researched, they are notaccustomed to file-level based de-duplication.We propose a file-level based data de-duplication of virtual machine disk images basedon the random sample method. The indexes in memory are not all the stored file indexes butonly some random sample indexes of the already stored file indexes. When we do datade-duplication process, we take advantage of the file locality in virtual machine disk images.If a file index is found in memory, we can assume that the directory which contains this filehas more file duplicate in the already stored files. So we do data de-duplication in only thedirectory instead of all file indexes on disk. This method can reduce the memory usage forindex storage for effective de-duplication, at the same time, only a few seeks are required forevery directory. By implement our method and evaluate the impact of de-duplication ratio indifferent ways, our methods can maintain a good de-duplication ratio while reduce thememory size to1/10of original memory size, avoiding the performance loss introduced bythe disk access bottleneck while there is not enough memory.In order to get a better data de-duplication ratio, we propose another duplicate dataelimination method based on directory partition. Firstly, we do directory partition to divideevery virtual machine disk images to near size directory, we choose samples from those directory partition using two ways, the random sample way and the sample based on Brodertheory way. The sampled file indexes are used to create sparse index in memory. The processof data de-duplication is implemented and we analyses many different factors which willinfluence the de-duplication ratio include the sample ratio, the magnitude of directorypartition, the random sample and Broder theory based sample way. We compare the best datade-duplication ratio with our random sample based de-duplication and directory partitionbased de-duplication. And the theoretical analysis and the experimental results show that thedirectory partition based de-duplication can improve performance of de-duplication storagesystems with a modest amount of physical memory.In order to solve the limited scalability of the centralized data de-duplication system, wepropose the distribute de-duplication system. To distribute and parallelize data de-duplicationusing multiple independent backup nodes, we propose directory partition route method,distributed data storage, data migration method in distributed environment. After that, weanalyze the characteristic of this system, practicability and impact to data de-duplicationperformance. Our strategy can avoid the communication between backup nodes, makebackup nodes independent to each other and distribute objects to maximize scalability of datade-duplication system.
Keywords/Search Tags:Data De-duplication, Sparse Indexing, Virtual Machine Image, FileDuplicate Locality
PDF Full Text Request
Related items