Font Size: a A A

Design And Optimization Of Hash Collision Resolution Technologies On WAN Deduplication

Posted on:2023-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:R Q BaoFull Text:PDF
GTID:2558307043974519Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years,the amount of digital data in the world is growing explosively.Therefore,data deduplication is used widely to tackle this challenge.The purpose of using data deduplication is to save network bandwidth by avoiding transmitting redundant data in wide area network(WAN)environments.WAN deduplication is mainly deployed at network edge devices(e.g.,switches).When data arrives at the switch,a typical deduplication system splits the input data stream into multiple data chunks that are each uniquely identified and duplicate-detected by a cryptographically secure hash signature(e.g.,SHA-1),also called a fingerprint.Deduplication systems then remove duplicate data chunks and transfer only one copy of them to achieve the goal of saving network bandwidth.But all hash functions have potential collisions in which two different data chunks share the same hash value.There are thousands of TCP streams in network edge devices,and the traditional single-layer secure hash algorithm will greatly increase the probability of hash collision.When a hash collision happens,it can cause unrecoverable data corruption.The design and optimization of hash collision resolution technologies on WAN deduplication are proposed to address this problem.Firstly,the Double-Layer hash collision resolution is presented by exploiting the higher cryptographic hash speed of MD5 and SHA-1.Meanwhile,a mathematical model is constructed to analyze the probability of collision on WAN deduplication.The result shows that with the impact on the network throughput less than 10%,the Double-Layer hash collision resolution can effectively decrease the probability of hash collision on the WAN deduplication system.Secondly,the Double-Layer hash collision resolution requires additional computing overhead and memory overhead to reduce the probability of hash collision.An Aware Similarity-based Double-Layer hash collision resolution is presented by distinguishing large and small files and effectively exploiting similarity.Which can gain a better network throughput than SHA-1 with lower memory overhead.Last but not least,due to the poor scalability of multiple network edge devices circumstance,a Locality-based hash path division collision resolution was presented.To preserve the locality of an input data stream,chunks are grouped into data segments.Then extracting their features.Routing the duplicate data and similar data to the same network edge device.Which will make the fingerprint dictionary with a well-balanced load.Experimental evaluation based on real-world datasets shows that the WAN deduplication system DedupProxy can achieve a deduplication ratio of 79.67%.Exploiting WAN deduplication can reduce the network transfer time by 65.89% compared with the directly using of SFTP protocol.The Aware Similarity-based Double-Layer hash collision resolution only consumes a RAM capacity that is about 6.25% of that consumed by the SHA-1.The Locality-based hash path division collision resolution can effectively balance the fingerprint dictionary load while saving 17.51% of the capacity required compared with dictionary integration solutions.
Keywords/Search Tags:Redundancy Elimination, Wide Area Network Deduplication, Hash Collision Resolution, Collision Free
PDF Full Text Request
Related items