Font Size: a A A

Research And System Implementation Of Deduplication Algorithm Based On Distributed Storage

Posted on:2023-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:M Y HaoFull Text:PDF
GTID:2558306914463674Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of network data,the storage pressure of data centers increases accordingly.As an important data reduction method,deduplication plays a crucial role in saving storage space and reducing storage costs.Deduplication technology in distributed storage system has become a research hotspot in the field of distributed storage because of its advantages of parallelism,large data volume,and high overall throughput.However,there are still some problems with deduplication in a distributed environment.In the data routing stage,finding the target node that can achieve a better deduplication effect needs high computing and communication costs.During the deduplication process in a single node,due to the increasing amount of data,the limited memory space can’t store all data fingerprints,resulting in the bottleneck of disk search.Therefore,to improve the above two problems,this paper proposes a similarity stateful routing method based on feature awareness and a data deduplication method based on historical information feedback,respectively,and a distributed deduplication system is designed and implemented.The main contents of this paper are as follows:(1)To improve the problem of high system overhead in the data routing stage,this paper proposes a similarity stateful routing strategy based on feature awareness.By extracting data features and data distribution features,this method designs a feature-aware node selection strategy to reduce the complexity of computing data similarity.In addition,this method also uses the super chunk and handprint technology to perform stateful data routing,which can maintain the load balance of the system while ensuring the high deduplication rate of the system.(2)Aiming at the disk bottleneck problem of chunk search caused by single-node deduplication process,a deduplication method based on historical information feedback is proposed,and an efficient fingerprint index structure is designed in memory to reflect the impact of data chunks in the effect of historical deduplication.This method uses the Thompson Sampling algorithm to mine the relationship between the influence of data chunks and data prefetching,which improves data prefetching accuracy through dynamic prefetching and caching to achieve higher system deduplication rate.(3)A distributed deduplication system is designed and implemented.The system is mainly divided into client and server,and achieves file upload,user management,file management,data deduplication and storage.The data deduplication function utilizes the proposed data routing strategy and the improved fingerprint index structure to detect and eliminate redundant data,so as to save storage space.The test results show that the system can effectively deduplicate data.
Keywords/Search Tags:Data Deduplication, Data Routing, Disk Bottleneck, Feature Aware, Data Prefetching
PDF Full Text Request
Related items