Font Size: a A A

Research On A File-level Data Reduplication Approach In Cloud Storage Systems

Posted on:2020-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:Q F HuangFull Text:PDF
GTID:2428330590451157Subject:Software engineering
Abstract/Summary:PDF Full Text Request
According to statistics,there are about 30 to 60 percent of the duplicate data in the global cloud storage system,and up to 70 percent for ordinary users.However,duplicate data processing technology is mostly used in the field of backup,and Research on duplicate file processing before file upload is rare.Detailed design of online file redesign will undoubtedly bring great improvement to the performance of file system.Aiming at the file-level data detection technology based on file system layer in cloud storage system,this paper adopts a method of file duplication removal based on client-server division of labor,which includes two aspects: first,a method of file pre-screening based on Bloom filter is proposed,and secondly,a PIA algorithm is proposed for file incremental segmented summary calculation.Finally,based on the above methods,this paper designs the whole de-duplication system.Firstly,the pre-screening process is needed after uploading files.After comparing the objective attributes of files in Bloom filter and table partition,the non-existent files in the system will be uploaded directly without participating in the subsequent calculation.Secondly,for the files that may exist in the system,the PIA algorithm is compared in detail.After uploading the files,the unfinished work of the client will be continued by the server.The core idea of the whole process is to upload files that do not exist in the client layer by layer filtering system to the server,so that they do not participate in the subsequent calculation of the client,so as to improve the resource utilization of the server and reduce the cost of the client.Finally,the experiment is carried out in FastDFS distributed file system,and the PIA algorithm proposed in this paper is compared with the full-file digest algorithm for removing FastDHT.The experimental results show that PIA algorithm can judge and process duplicate files quickly without reducing the deletion rate of duplicate files,which greatly reduces the burden of computing resources.The data show that under the optimal condition,the algorithm filters out non-duplicate files with 2 ms CPU occupancy of 0.39% and memory occupancy growth of no more than 0.1 GB,and in the worst case,the algorithm is the same as the full-file summarization algorithm.
Keywords/Search Tags:File De-duplication, FastDFS Distributed File System, Bloom Filter, Digest Algorithm
PDF Full Text Request
Related items