Font Size: a A A

Lossless Compression Techniques For Similar Data

Posted on:2012-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:G Y ZhaoFull Text:PDF
GTID:2248330395458250Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the advent of the information age the amount of information increases explosively. Like network information, observation data, and biological information, there are a lot of data with high similarity. However, traditional data compression methods could not provide a better compression rate for such data. Making a lossless data compression effectively on such similar data is of great significance.Nowadays, a lot of researches have been done in developing a new compression method using base sequence and variants to represent high similarity biological information. Due to the high similarity of the data, a large biological sequence can be compressed by only a few variants. Meanwhile, in database area, some scholars also use such compression method deal with high dimensionality relations base on semantic information. The main purpose of this thesis is to solve the efficient compression and decompression problem using base sequence and variants.To begin with, we summarize the framework of the loseless compression based on variants. Through the analysis of these algorithms, we propose our loseless compression method based on general similar data. We mainly use the base sequence and a serious of variants to express the entire dataset based on edit distance and the Smith-Waterman similarity. Due to the real data does not have an overall similarity but usually has obvious characteristic similar blocks, we propose an idea which cluster the data first and then compress it. We also find a compromise way between the cluster number and cluster similarity to optimize the data compression. Draw on the experience of sequence alignment algorithms; we solve the problem of constructing the real cluster center of sequences. After optimizing the expression of variants, we also give our efficient decompression algorithm. In the end, a large number of experimental tests and analysis on real data sets show that the proposed lossless compression technique can achieve good compression ratio on similar sequence data.
Keywords/Search Tags:lossless compression, variant expression, edit distance, cluster, string sequence
PDF Full Text Request
Related items