Font Size: a A A

Research On Efficient Data Deduplication Storage Technology Based On Intelligent Analysis Of Data Stream

Posted on:2024-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:L H LiFull Text:PDF
GTID:2558307127960889Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Data deduplication technique is an efficient method for data compression today,which can save a lot of storage space for large-scale data storage and backup.Deduplication has been a hot topic in the storage field in recent years,and is widely used in data backup,archive storage,remote disaster recovery,etc.Due to the large number of duplicate data,this technique can not only increase the data storage efficiency,but also reduce the traffic volume and increase the efficiency of the network.Today’s data deduplication technique has the following major challenges in data center storage: Firstly,block-based deduplication has a higher data deletion rate than file-based deduplication,but it can easily result in data fragmentation,resulting in read The problem of amplification poses challenges for efficient data recovery.Secondly,in the index search phase,because of the large number of data blocks,a lot of disk I/O is needed,which leads to a bottleneck in disk access.In order to solve these problems,this paper reasonably designs the distribution of data block storage,optimizes the index structure of data blocks,and proposes the following solutions:1)This paper presents a data rewriting algorithm(Cache Prediction Rewriting,CPR)based on cache prediction and submodule maximization.Firstly,the submodular maximization model is used to model the container selection problem,and several containers containing the most different data blocks are selected,and the concept of read amplification level is proposed to limit some fragmented containers from being Selected into the result set.Then,analyze the locality between data segments,and delete the duplicate blocks in the container that must be hit by the cache through the prediction of the cache,so that there are fewer data fragments and the distribution of data blocks is more reasonable.Finally,experiments are carried out on the simulated data streams of multiple data sets,and the results verify the effectiveness of the algorithm,which shows that the algorithm can improve the deduplication rate and data recovery speed of repeated data.2)Based on the rewriting algorithm,a container-level lightweight index design is proposed.First,in the disk component,the data list of the data block storage in the container is maintained,replacing the global data block index in the disk.Then,a container-level lightweight index is maintained in memory.The key of the index is the fingerprint of the sample data block randomly selected by sampling the data segment.The value of the index is a pointer to the container containing the sample data block.pointer.When processing a data segment,we randomly sample some data blocks in it.The selected sample data blocks can help us find the data container with the strongest locality,the most similar data distribution and the densest data distribution,and load the corresponding data list for fingerprinting to compare and delete duplicates.Finally,experiments prove that our proposed method can achieve higher throughput compared with the state-of-the-art methods,and mitigate the disk access bottleneck problem.
Keywords/Search Tags:Data deduplication technology, Data fragmentation, Data rewriting, Memory index
PDF Full Text Request
Related items