Font Size: a A A

Design And Implementation Of Searchable Compression Algorithm And Its Application In ClamAV

Posted on:2019-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:W X ZhangFull Text:PDF
GTID:2428330548459135Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
At present,due to the improvement of Internet technology and the rapid development of the Internet,data information has rapidly increased.With the increase of data volume,higher requirements have been placed on the storage,transmission,and processing of massive data.How a large amount of data can be searched quickly while reducing its space occupation has become a new problem that can be to exploring and needs to be solved.In this paper we propose a method that uses a new compression mode to process data.Different from the general compression mode processing method,the compression mode we proposed can reduce the size of the data file as well as support related operations such as searching directly in the compressed file,thereby achieving the dual purpose of searching and retrieving data quickly while reducing the occupation of data storage space.The compression algorithm proposed in this paper is a character string replacement method based on compression dictionary.The text processing is performed on the byte stream,so it is suitable for all file types.The main idea of the process is to exchange the byte pairs appearing in the text at high frequency(ie two consecutive bytes)with the single byte in the text with the low frequencies.When the frequency of the byte pairs we select in the text is higher than the frequency of the bytes we selected,compressing the text will produce a compression effect.For natural texts,compression is basically achieved.In addition,for byte pairs and bytes used for compression,in order to avoid ambiguity in the compression and decompression process,it is necessary to limit their selection,that is,there is no overlap between any two byte pairs selected.Because thenumber of byte pairs is much larger than the number of single bytes,the probability of it appearing high frequency is also lower than the probability of high frequency occurrence of a single byte,in order to ensure the maximum compression effect,according to the compression method The characteristics,different from the heuristic method used in Manber's compression mode introduced in this paper,this paper adopts the greedy method to select non-overlapping byte pairs.Since the maximum number of single bytes is 256,the maximum number of available byte pairs is 128.This compression mode based on the compression dictionary does not depend on the text context to ensure the searchability of the compressed file.When we search for compressed data,we first compress the data to be searched using the same compression method,then search for it in the compressed file,and decompress the searched results to get the final result.Experiments show that the compression effect of this algorithm can reach 15% on average.The algorithm is also applied to the grep search program and the ClamAV anti-virus system.According to experimental results,the search speeds of the two application modes can be increased by up to 81% and 15% respectively in the best case.In addition,we have integrated this method in ClamAV to facilitate its use.In order to better improve the compression ratio of the compressed method in large non-English texts,we chose the UTF-8 encoding format widely used in network transmission,and made a special algorithm improvement for this encoding format.In the medium and small-scale Chinese text compression experiments,the average compression rate can reach nearly 20%.In the grep finder,the UTF-8 format Chinese mode search speed can be increased by up to 70%.For large and highly duplicated files,it will have a higher compression rate.
Keywords/Search Tags:Algorithms, Data compression, Pattern matching, ClamAV
PDF Full Text Request
Related items