Font Size: a A A

Research On Semantic Compression Algorithms Of Massive Data Tables

Posted on:2007-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:J FengFull Text:PDF
GTID:2178360212465568Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the coming of information epoch, people are facing a huge amount of information, which is increasing rapidly and momently. Compressing the data to store them efficiently therefore becomes more and more important both theoretically and practically.There are two techniques for data compression. The first one is statistics-based and it decreases the numeral redundancy in the data. The second one is semantic-based and it decreases the content redundancy in the data.From the point of view of compression, data can be classified as non-numerical and numerical data. Effective exploratory analysis of massive, high-dimensional tables of data, which is viewed as structured numerical data, is a ubiquitous requirement for a variety of application environments.The first data compression methods have been proposed for non-numerical data, such as text corpora and multimedia data. These methods, however, fail to provide adequate solutions for compressing structural numerical data, as they view the table as a large byte string and do not account for the complex dependency patterns in the table. For the structural numerical data, people propose the semantic compression theory. The semantic compression explores the semantic model, reveals the connotative signification and latent relationship, which are applied to compression algorithm. In a general way, semantic compression belongs to the lossy compression, and grants the prescribed error bounds.The research of the paper is on the semantic compression algorithms for massive data tables. The existing semantic compression methods have some disadvantages in adaptability and performance. The paper proposes a Bidirectional Semantic Compression (BSC) framework that takes advantage of data characteristic and data-mining models to perform lossy compression for massive data tables.BSC integrates the column-wise compression and the row-wise compression, analyze all kinds of data characteristics, such as linear correlation and time-serials property, and exploit different compression strategies.If there are evident linear correlation among attributes of the data table, BSC exploits the PCA-Clustering compression algorithm; If there aren't evident linear correlation among attributes of the data table, and the data haven't the time-serials property, BSC exploits PMA-Clustering compression algorithm; If there aren't evident linear correlation among attributes of the data table, and the data have the time-serials property, BSC exploits PMA-TS compression algorithm.Extensive experiments were conducted and the results indicate the superiority of BSC over previously known techniques.The original data tables are reorganized by the compression methods mentioned above, within the prescribed error bounds, resulting in the compression plan described in the form of XML.
Keywords/Search Tags:Semantic Compression, Predictive Model, Clustering Analysis, Compression Plan
PDF Full Text Request
Related items