Font Size: a A A

Research On Data De-duplication Technology In Parallel Systems

Posted on:2014-03-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:R ZhuFull Text:PDF
GTID:1228330425973321Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the development of internet, more and more applications based on the network have created massive data which has arrived several petabytes. With the attention of finding and removing the redundant data for efficient storage space usage, data de-duplication has become an important technology in the research of network storage systems. As a computation intensive and I/O intensive application, the performance of data de-duplication was most influenced by the hash chunking computation and disk index searching works when there were more and more data to be backuped. Thus, as the popular of multi-core and many-core processors, it is important to improve the performance of computation and disk-index access works in parallel environment.Recently, most researches on parallel data de-duplication were based on GPU accelerating computation and multi-threaded disk index accesses. However, as the increases of parallelism, these research methods have some performance bottleneck which influenced the scalability of the parallelism. Thus, this research has analyzed the most important factors based on corresponding performance model of the two methods. Based on the analyzing results, it has proposed two optimized solutions to alleviate the performance bottlenecks of the two methods and tested the improvements with real-world data set experiments.In the research of performance analyzing in parallel systems, it is important to consider the main aspects in the procedure of the systems for finding the performance bottlenecks. Considering the main procedure, concurrency, resource sharing and competing in the GPU accelerating and parallel disk index accessing methods, this paper has proposed two performance models which were based on the Stochastic Petri net. By computing the use ratio of the models, it has deduced the performance bottleneck of the two methods. Then, it has proposed corresponding improved methods to alleviate the performance bottlenecks and tested the methods with real-world data sets.In the GPU accelerated de-duplication method, the data transfer latency between RAM and GPU memory was the main performance bottleneck. This paper discovered that the repeat data transfer of the same data sets has produced redundant transfer latency. To remove the redundant data transfer latency, this paper has optimized procedure based on the traditional GPU accelerated method. In the experiment, this optimized method has achieved a better performance.Due to the requirement of unique data stored, the parallel disk accessing method need a synchronize mechanism to avoid the data conflict in different accessing threads. In traditional method, the lock based mechanism has incurred a heavy consistency overhead when there are more index accessing threads. To alleviate this performance bottleneck, it has proposed a DHT based index accessing mechanism which accessing different sub-index by different suffix chunks attached different threads. In the experiments, this method has achieved a better performance.
Keywords/Search Tags:Data De-duplication, Parallel, Multi-core, Stochastic Petri Net
PDF Full Text Request
Related items