Font Size: a A A

Research On Disaster Recovery Oriented Key Technology Of Lossless Data Compression

Posted on:2011-09-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:W L ChangFull Text:PDF
GTID:1118330338489433Subject:Information security
Abstract/Summary:PDF Full Text Request
Data compression is the process of encoding information using fewer bits than an unencoded representation would use through use of specific encoding schemes. Compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. The specific application environment and specific technical requirements for data compression technology in disaster backup system require my dissertation to focus on these issues, which include developing a Chinese corpus for the evaluation of lossless compression algorithm, exploring the method that improve the compression ratio of Chinese text, investigating the block algorithm of basic compression algorithms, as well as extending the application of data compression technology, and the integration of data compression and cryptography etc..As concerning the evaluation of lossless compression algorithm for Chinese-based applications, since the data compression technology has been researched mainly by the countries in Europe and America, a number of authors have used the Calgary Corpus and Canterbury Corpus to provide experimental results for lossless compression algorithms, these types of test set are ASCII encoded. However, in the Chinese language environment, ANSI coded data is in a dominant position, it needs to expand the compressed test set, add Chinese encoded test samples. So, HITICT, a Chinese corpus for the evaluation of lossless compression algorithms based on ANSI code, was proposed. In accordance with the principle of application representativeness, Complementary principle and openness principle, a large number of candidate files were obtained from the Internet, and then average compression ratio, average correlation coefficient, compression ratio correlation coefficient and standard deviation were used to select 10 files that give the most accurate indication of the overall performance of compression algorithms. Experimental results show that this collection has a good representativeness and stability, and can be used as the supplementary test set of the main benchmark for comparing compression methods.For the Chinese text, we present a universal compression algorithm, CRecode, which demonstrates an accurate understanding of the properties of the ANSI coded Chinese data stream. CRecode highlights the importance of pre-processing work for Chinese, it collect the Chinese Characters and sorts them by frequency order, then recode them into 8-bit, 16-bit or 24-bit code. CRecode can act as a pre-processing tool for ANSI coded Chinese data stream by all the popular compression utilities, which can improve their compression ratio from 4% to 30%.The mainstream lossless data compression algorithms have been extensively studied in recent years. However, rather less attention has been paid to the block algorithm of those algorithms, so in this paper, we also investigated the block performance of LZSS algorithm. We studied the relationship between the compression ratio of block LZSS and the value of index (IA field) or length (Len field). We found that as the block size increases, the compression ratio becomes better. The bit of length (Len field) has little effect on the compression performance, and the bit of index (IA field) has a significant effect on the compression ratio, the more the bit of index is set, the bigger optimal block size is obtained. For time efficiency, as the volume of block being compressed grows, compression become increasingly effective in reducing the overall compression time, the gain is due to disk activity.Random Number Generators play a critical role in a number of important applications. In practice, NIST statistical or the Diehard tests are employed to gather evidence that a generator indeed produces numbers that appear to be random, we report on the studies that were conducted on the compressed data using 8 compressors. We use an overall quality metric SRV to compare the randomness of different compressed files. The test results suggest that the output of compression algorithms or compressors has bad randomness, the compression algorithms or compressors are not suitable as random number generator. We also found that, for the same compression algorithm, there exists positive correlation relationship between compression ratio and randomness, increasing the compression ratio increases randomness of compressed data. Then inspired by the LZSS compression algorithm and RC4 stream cipher, a pseudo-random sequence generator (PRNG) was presented and implemented. The result of the NIST and Diehard test suite indicate that it is a good PRNG, and so it seems to be sound and may be suitable for use in some cryptographic applications.The mainstream compression algorithms are context-sensitive except the static Huffman coding, that is, each bit of the compressed data depends on the context of the relevant statistical or dictionary information. Current mainstream compression tools are a collection of basic compression algorithm, which exacerbate the interdependence between the compressed data. According to this characteristics of the compressed data, the paper studies the encryption method of a variety of compression algorithms, and gives the smallest data which must be encrypted.
Keywords/Search Tags:loseless data compression, block compression, Huffman, LZSS, corpus, PRNG
PDF Full Text Request
Related items