Font Size: a A A

Study On Distributed Compression Storage Optimization Based On RCFile Storage Model

Posted on:2018-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:H Y HeFull Text:PDF
GTID:2348330536479633Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
A staggering amount of data is now being generated with the development of cloud computing,IoT,and social networks.In order to process this ever-growing amount of data,data compression is essential to reduce data storage volume and improve storage efficiency.RCFile is a representative of column record storage model,which faces two challenges about the data compressibility.Firstly,the data in one storage node often comes from different clients.It results in greater difference among adjacent records in the same storage node.The similarity of the data in one row of the RowGroups is low when the records are converted to RCFile storage format.Secondly,the RCFile currently uses a single Gzip compression algorithm to compress the Row Groups.It ignores the data types and distribution characteristics in different rows of RowGroups.In order to solve the above problems,the specific study contents are as follows:Firstly,a Pre-Processing Distribution Model(PPDM)is designed to deal with the data from different clients.In PPDM,we first define a set of standard data vectors to divide the data space into several similar data spaces uniformly.Each partitioned data space matches a data storage node.On this basis,we use Pre-Processing Distribution Algorithm to determine which storage node the client data belong to.The experimental results testify that the method we proposed can effectively improve the data compression ratio in RCFile with the refinement of the division of data space.Secondly,an Adaptive Compression Strategy Based on Compression Cost(ACSCC)is proposed to solve the disadvantages of using a single Gzip compression algorithm to compress RowGroups in RCFile.In ACSCC,we first define a compression cost to evaluate the performance of the different compression algorithms.Then we can obtain the recommended compression algorithm by calculating the similarity between current row sample data and reference field data.In order to guarantee the effectiveness of ACSCC,we reselect the compression algorithm for the next set of data to be compressed by comparing the difference between the data compression ratio of current row and the mean data compression ratio of pre-sequence with the compression ratio threshold.Experimental results on benchmark data set TPC-H testify that we can effectively improve the compression performance by ACSCC.
Keywords/Search Tags:Distributed Compression and Storage, RCFile, Preprocessing, Compression Cost, Compression Strategy
PDF Full Text Request
Related items