Font Size: a A A

Research On Key Technology In Mass Data Processing Based On Inline Deduplication

Posted on:2013-01-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:C WangFull Text:PDF
GTID:1118330374486983Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid developments of the information technology and the emergingnetwork applications, the electronic data increase sharply. As a result, it makes theresource requirement for storing and backuping these mass data grow exponentially, andthe scales of data centers are developing towards the PB level, or even the EB level. Therelated research shows that60%of the data, on average, in the various applicationsystems are duplicate, and the volumes of duplicate data keep on growing as time goeson. For example, there exist a lot of duplicate data in the office automation system, thearchival storage system, the group mail and attachments, the Web pages, and thesoftware release. In addition, the periodic backups accelerate the increase of duplicatedata even more. The repetitive storing and transferring of these duplicate data consumea great deal of storage space and network bandwidth which increase the cost of datamanagement significantly. Consequently, it has become a hot research topic in the areaof backup and storage that improving the resource utilization and reducing the cost bythe deduplication technology.For a mass backup system based on inline deduplication, it is needed not only toimprove the data compression performance, but also to ensure the security feature andthe high throughput performance. Therefore, the research works of this dissertationmainly focus on the following points: improving the compression performance of theduplicate data detecting method, improving the data recovery performance of the lineardelta chain, improving the data security of the deduplication system and improving thethroughput of the deduplication system. The main innovative contributions of thisdissertation include:(1) A duplicate data detecting method based on pre-chunking and slidingwindow is proposed. This method mitigates the contradiction between improving thecompression performance and reducing the metadata cost by using different chunkingstrategies on the data changing region and non-changing region respectively. So, thebottleneck of further improving the compression performance is broken through. Thismethod can achieve a satisfying compression ratio with a relatively larger expected chunk size, and its time cost is much lower than the current stateful detecting method.(2) A version transformation algorithm of delta file is proposed. The neededversion file can be obtained without calculating the middle version files. So, the datarecovery performance of the linear delta chain is improved notably and its optimalcompression performance is retained at the same time. The compression performance ofthe delta backup system based on this algorithm is much higher than the system basedon the jumping version chain and the time cost of data recovery is much lower than thetraditional data recovery method.(3) A deduplication-oriented data encryption method is proposed. This methoduses the chunk as the basic unit of encryption and the symmetric keys used to encryptthe chunks are generated by a convergent method. Thus, the influences caused by thedisagreement of key choosing and the avalanche effect of encryption algorithm areeliminated. This method solves the problem that the traditional encryption method is notcompatible with the deduplication technology, so it can ensure the data confidentialityand the compression performance of the system simultaneously.(4) Two throughput-improving methods, which do not depend on the datalocality of backup loads, are proposed. Firstly, a throughput-improving method that isapplicable to the mixed backup loads is designed. Then, on this basis, athroughput-improving method that is suitable for the distributed applicationenvironment is proposed. These two methods solve the locality dependent problem ofthe current throughput-improving methods, and can process the non-traditional backuploads effectively. Both of them can achieve a near-optimal compression performanceand a satisfying throughput performance.
Keywords/Search Tags:deduplication, mass data backup, data compression, data confidentiality, throughput
PDF Full Text Request
Related items