Font Size: a A A

Research On Parallel Data Redundancy Elimination Strategy In Cloud Storage With MPI And Four-stage Pipelines

Posted on:2021-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:B S ZhuFull Text:PDF
GTID:2428330605972992Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The cloud storage is an effective and critical technology for data storage in the context of the era of big data at this stage.However,since many files or data blocks are stored in the cloud server,the cloud server resource may be wasted when the same file or data block is repeatedly stored,so the technology of data redundancy elimination should be applied.Nevertheless,as the result of the problem that the time of eliminating redundancy is too long as well as the cloud server resource could not be fully utilized in the data deduplication process ing,the speed of data redundancy elimination processing needs to be accelerated,and the resource utilization rate of the cloud server ought to be improved.This paper proposes a parallel data redundancy elimination strategy with message passing interface and four-stage pipelines in cloud storage,in which the client uses the four-stage pipelines parallel data partition strategy to accelerate the partition processing of data(including reading file,data partition,data compression,fingerprint calculation);after the master node receives plenty of data blocks and block metadata in the cloud server,the received block met adata is evenly distributed to each slave node through MPI,which performing data deduplication on the global bloom filter matrix.When a false positive misjudgment occurs due to the hash collision,the process of eliminating redundancy is performed on the secondary indexing structure.Thus,the parallel data redundancy elimination processing is performed on multiple slave nodes,and the data block without redundancy is stored on the corresponding slave node,thereby completing the data redundancy elimination processing in the cloud storage environment.The experiments consist of three parts,the client using the four-stage pipelines parallel data blocking strategy to partition the file(including reading file,data partition,data compression,fingerprint calculation);the Virtual Machine and MPI are used to build the parallel computing environment including many nodes,one node is the master and others are slaves on the side of cloud server.The operating system Cent OS 7 is picked up to verify the time of data redundancy elimination and the time of the whole processing when the number of slave node is 4,8,16,24 as well as 32 individually aiming at 2.19 MB size file and 300MB;the increasing indexing performance verification of the secondary indexing structure applied by servers in cloud when retrieving data.It is turned out that on the client side,the parallel partition strategy using fourstage pipelines can greatly improve the data partition speed compared with the single-core processor.The cloud server adopts MPI based parallel data redundancy elimination strategy.Compared with the strategy of sending each data block to any slave node for data deduplication processing,the time for data redundancy elimination processing can be extremely decreased.And the more data blocks are,the more obvious this trend is.The data retrieval time can be decreased to further improve the indexing performance when the secondary indexing structure is used by servers in cloud compared with the linked hash indexing struct ure.The more the size of the file is,the more improved the performance of the data retrieval is.
Keywords/Search Tags:MPI, four-stage pipelines, cloud storage, data redundancy elimination, parallel computing
PDF Full Text Request
Related items