With the advent of the big data,data from all walks of life have increased rapidly,and the limitaions of traditional database have been exposed,many NoSQL technologies have been flourishing.Among them HBase has many advantages such as high expansibility,high reliability and high performance.It has attracted much attention in the industry.Although many Internet companies use HBase extensively,it still has some disadvantages,such as the fact that it dose not compress the data efficiently with the charactistics of column data.Therefore,it is of great significance to study efficient and practical compression algorithms that return query data quickly.According to different characteristics of each column,selecting different algorithms is a classification problem.This thesis chooses a Bayesian classifier which has simple struture and high classification accuracy.Because of the lack of assumptions based on conditional independence,this paper presents a new weighting coefficient calculation algorithm for Naive Bayesian classifiers,the weighting coefficients of the algorithm are the average value of covariance weighting coefficients and information entropy-based weighting coefficients,the improved algorithm not only consides the impact of the two attributes,but also considers the impact of a single attribute on the entire attribute set.The advantage of Protocol Buffer's Base-128 Varints encoding is that it can reduce the size of serialized data and can be applied to storing data.Run length encoding and dictionary encoding are suitable for use in the scene with high data similarity,this thesis proposes an improved way that using Varints to encode the size of elements in the run length encoding and integer numbers in the dictionary encoding with integer indexes,experiments show that the impoved algorithm improves the compression rate under certain scenarios.In this thesis,HBase is used as the database of the classifier experiment,and seven algorithms are selected as the HBase compression algorithm family,including run length encoding,improved run length encoding,dictionary encoding,improved dictionary encoding,Gzip,Lz4 and Snappy.When HBase stores data,the classifier calculates a suitable algorithm based on the characteristics of the data and then stores it.Therefore,this thesis examines how to add the classifier to HBase.Finally,compre new Feature Weighted Bayesian Classifier with the Naive Bayesian Classifier by applying on classifying HBase column compression,mainly from thress aspects,compression ratio,compression speed and decompression speed.The experimental results show that the improved Bayesian classifier is better than the naive Bayesian in the classification of the compression algorithm,and the cost of the compression time and query time is almos the same as naive Bayesian,so the improved algorithm is feasible and applicable. |