Research On Efficient Data Deduplication Storage Technology Based On Intelligent Analysis Of Data Stream

Posted on:2024-01-25

Degree:Master

Type:Thesis

Country:China

Candidate:L H Li

Full Text:PDF

GTID:2558307127960889

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Data deduplication technique is an efficient method for data compression today,which can save a lot of storage space for large-scale data storage and backup.Deduplication has been a hot topic in the storage field in recent years,and is widely used in data backup,archive storage,remote disaster recovery,etc.Due to the large number of duplicate data,this technique can not only increase the data storage efficiency,but also reduce the traffic volume and increase the efficiency of the network.Today’s data deduplication technique has the following major challenges in data center storage: Firstly,block-based deduplication has a higher data deletion rate than file-based deduplication,but it can easily result in data fragmentation,resulting in read The problem of amplification poses challenges for efficient data recovery.Secondly,in the index search phase,because of the large number of data blocks,a lot of disk I/O is needed,which leads to a bottleneck in disk access.In order to solve these problems,this paper reasonably designs the distribution of data block storage,optimizes the index structure of data blocks,and proposes the following solutions:1)This paper presents a data rewriting algorithm(Cache Prediction Rewriting,CPR)based on cache prediction and submodule maximization.Firstly,the submodular maximization model is used to model the container selection problem,and several containers containing the most different data blocks are selected,and the concept of read amplification level is proposed to limit some fragmented containers from being Selected into the result set.Then,analyze the locality between data segments,and delete the duplicate blocks in the container that must be hit by the cache through the prediction of the cache,so that there are fewer data fragments and the distribution of data blocks is more reasonable.Finally,experiments are carried out on the simulated data streams of multiple data sets,and the results verify the effectiveness of the algorithm,which shows that the algorithm can improve the deduplication rate and data recovery speed of repeated data.2)Based on the rewriting algorithm,a container-level lightweight index design is proposed.First,in the disk component,the data list of the data block storage in the container is maintained,replacing the global data block index in the disk.Then,a container-level lightweight index is maintained in memory.The key of the index is the fingerprint of the sample data block randomly selected by sampling the data segment.The value of the index is a pointer to the container containing the sample data block.pointer.When processing a data segment,we randomly sample some data blocks in it.The selected sample data blocks can help us find the data container with the strongest locality,the most similar data distribution and the densest data distribution,and load the corresponding data list for fingerprinting to compare and delete duplicates.Finally,experiments prove that our proposed method can achieve higher throughput compared with the state-of-the-art methods,and mitigate the disk access bottleneck problem.

Keywords/Search Tags:

Data deduplication technology, Data fragmentation, Data rewriting, Memory index

PDF Full Text Request

Related items

1	Research On Building Efficient Data Deduplication Storage Systems For Data Backup
2	Research On Performance Optimization Based On Container Characteristics In Deduplication-based Backup Systems
3	Design And Research On A High-performance Deduplication System
4	Study On Data Deduplication Technique For Data Backup Systems
5	An Efficient Data Deduplication Design with Flash-Memory Based Solid State Drive
6	Study On Data Fragmentation For Data Backup Systems
7	Design And Analysis Of Intelligent Prefetching Algorithm For Data Deduplication
8	Research On Key Techniques Of Data Deduplication In The Environment Of Big Data
9	Research On Restore Performance Optimization Of Redundancy Elimination Systems For Data Backup
10	Research On Data Deduplication Based On File Access Patterns