With the fast development of digital information technology and the exponential growth of the global data size, the storage scale of the data center which is containning a great deal of redundancy data marches into the PB level or even EB level. Since these redundant data occupy a large amount of storage resources, they lead to the problems of worsening performance of the storage system, the high costs of data storage and management and other issues. In this background, by demonstrating great academic and application values, the Storage Capacity Reduction Technology(SCRT) effectively reduces the data scale, the cost of management, and improves the storage utilization without changing the basic data attributes by data deduplication and data compression.The big data has the following features: the big scale, various types, huge redundancy and higher demand for the speed of data processing. All the characteristics lead to the following technical problems in the SCRT when facing big data applicaions, such as large amount of time cost in chunking and discovery time of redundant chunk and the compression speed and ratio. To solve the above problems, researches of SCRT on big data have been carried out. Specifically, the contributions of this dissertation are threefold.1. Bit-string Content-aware Chunking Strategy(BCCS): This paper analyzes and discusses various factors affecting the performance of data chunking, and realizes a new digital signature technology based on bit string. By extracting one special bit of every text byte, BCCS constitutes its window characteristics data. Therefore it achieves the biggest jump length, speeds up the binary string matching process and reduces the CPU resource consumption of the chunk boundary owing to the replacement of the traditional comparison operations by bit operations, and optimization of every matching process. Results show that the BCCS speed has increased by 197% at the best compared to Rabin on the Unimmobilized data set; the BCCS speed has decreased by 10.8% compared to FSP on the Immobilized data set, while the compression ratio has improved by 20% from 0.977 to 1.206.2. Redundant Chunk Query Mechanism based on Two-staged Bloom Filter(RCQM-TBF): Aiming at the problems that the huge amount of FP can not be completely stored in memory and degradation of performance, RCQM-TBF is put forward. As the representative performance of the first staged bloom filter in RCQM-TBF, every bit of the second staged bloom filter represents the FP having the same quasi-level 2 false positive. For FP false positive access, TBF has a quick judgement of the nonexistence of the chunk by improving the false positive of the first and second staged bloom filters. For FP normal access, it judges the existence of the new arrived-chunk fast by building an FP cache link list and correspondent FP pre-fetching mechanism to reduce the hard disk access directily. At the same time, it sets up a hash function reducing the possibility of collision. Experiments show that the FP query delay and storage performances has improved by 28% at the best compared to the standard bloom filter algorithm ZHU-BLOOM FILTER for Unique Data Set(UDS). The storage speed has increased by 100% to 135% at the most compared to ZHU-BLOOM FILTER for Dedup Data Set(DDS). Theoretically, it can manage the maximum storage capacity of 64 PB when expanding the server memory.3. Parallel Matching LZSS based on Multiple Matrix(PMLZSS-MM): In order to improve the utilization of storage capacity, and speed up the compression of big data, this thesis proposes PMLZSS-MM. It recommends a parallel matrix matching work model on the GPU platform dynamically dividing the data object compressed into multiple-twin dictionary strings and pre-read strings. They form as the vertical and horizontal axes of the matrixes, decomposition to the GPU in different threads block, forming multiple matrix parallel matching. For the compression coding generation calling for serial execution, it is still executed on the CPU. It coordinates GPU and CPU to complete the task through reasonable scheduling strategy. The results show that compared to the classic serial LZSS algorithm, storage reduction rate of PMLZSS-MM has decreased by about 1.5%. However PMLZSS-MM obviously improves the compression speed of the big data. Compared to the serial LZSS on CPU platform, the compression throughout of PMLZSS-MM has increased by 18 times maximum; while compared to the parallel CULZSS on GPU platform, the compression throughout of PMLZSS-MM has increased by 20.8% maximum when the compression dictionary window set to 4KB and the pre-reading window 64 B.In summary, three discoveries are put forward in the thesis. First BCCS proposed can effectively reduce the CPU resource consumption in the chunking process, and improve the speed for finding the chunks boundaries. Secondly, with the application of RCQM-TBF, the fingerprint query speed is distinctly improved and so is the efficiency of query rate. Last but not least, PMLZSS-MM is put forward to supplement and optimize the above two findings, helping to obtain higher storage capacity reduction. |