Research And System Implementation Of Deduplication Algorithm Based On Distributed Storage

Posted on:2023-08-09

Degree:Master

Type:Thesis

Country:China

Candidate:M Y Hao

Full Text:PDF

GTID:2558306914463674

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid growth of network data,the storage pressure of data centers increases accordingly.As an important data reduction method,deduplication plays a crucial role in saving storage space and reducing storage costs.Deduplication technology in distributed storage system has become a research hotspot in the field of distributed storage because of its advantages of parallelism,large data volume,and high overall throughput.However,there are still some problems with deduplication in a distributed environment.In the data routing stage,finding the target node that can achieve a better deduplication effect needs high computing and communication costs.During the deduplication process in a single node,due to the increasing amount of data,the limited memory space can’t store all data fingerprints,resulting in the bottleneck of disk search.Therefore,to improve the above two problems,this paper proposes a similarity stateful routing method based on feature awareness and a data deduplication method based on historical information feedback,respectively,and a distributed deduplication system is designed and implemented.The main contents of this paper are as follows:(1)To improve the problem of high system overhead in the data routing stage,this paper proposes a similarity stateful routing strategy based on feature awareness.By extracting data features and data distribution features,this method designs a feature-aware node selection strategy to reduce the complexity of computing data similarity.In addition,this method also uses the super chunk and handprint technology to perform stateful data routing,which can maintain the load balance of the system while ensuring the high deduplication rate of the system.(2)Aiming at the disk bottleneck problem of chunk search caused by single-node deduplication process,a deduplication method based on historical information feedback is proposed,and an efficient fingerprint index structure is designed in memory to reflect the impact of data chunks in the effect of historical deduplication.This method uses the Thompson Sampling algorithm to mine the relationship between the influence of data chunks and data prefetching,which improves data prefetching accuracy through dynamic prefetching and caching to achieve higher system deduplication rate.(3)A distributed deduplication system is designed and implemented.The system is mainly divided into client and server,and achieves file upload,user management,file management,data deduplication and storage.The data deduplication function utilizes the proposed data routing strategy and the improved fingerprint index structure to detect and eliminate redundant data,so as to save storage space.The test results show that the system can effectively deduplicate data.

Keywords/Search Tags:

Data Deduplication, Data Routing, Disk Bottleneck, Feature Aware, Data Prefetching

PDF Full Text Request

Related items

1	Research On High I/O Performance Data Deduplication In Primary Storage System
2	Research On Performance Optimization Based On Container Characteristics In Deduplication-based Backup Systems
3	Research On Key Technologies Of Data Deduplication For Backup System
4	Design And Analysis Of Intelligent Prefetching Algorithm For Data Deduplication
5	Research Of Data Deduplication In Data Disaster Tolerance Systems
6	Research On Key Technologies Of Application-Aware Data Deduplication
7	Research On Duplicate Data Detection In Data Deduplication
8	Research On Routing Algorithm For Distributed Data Deduplication Systems
9	Study On Data Deduplication Technique For Data Backup Systems
10	Improving ISCSI Memory Cache Hit Through Prefetching To A Striped Disk