Font Size: a A A

Research On Data Mining Algorithm Based On Compressed Database

Posted on:2018-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2348330533969808Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the current social and economic prosperity and the advance of science,a large amount of data has been accumulated in all aspects of society.In database such as science and statistics,there are many kinds of important data,such as scientific experiment results,geographical mapping,census and economic activities,and these data are often static,after entering the database will not be changed,and will be permanently reserved.It leads to that a large amount of data stored in such database,and the methods of querying,computing and analyzing on traditional database become very large and it is difficult for us to accept for I/O transmission on such databases.Therefore,compressing the massive database becomes an important research direction.At present,many researchers have put forward many algorithms for compressing database.However,there are few studies on data mining and analysis on compressed databases.The contribution of this paper is that we propose method about how to data mining efficiently on compressed database.This paper mainly includes the following four aspects:The scientific and statistical database has the characteristics of static,sparse,aggregation and repeatability.Based on these characteristics,we propose a new database compression algorithm which compresses data records in block,and we also analyze theoretically about this algorithm.Compared with other database compression algorithms through experiments,it's proved that the compression algorithm proposed in this paper has a high compression ratio in scientific and statistical databases.About mining association rules,this paper proposes a CApriori algorithm,which operates directly on compressed database.At the same time,this paper makes a theoretical analysis about the promotion of CApriori algorithm compared with Apriori algorithm.And through experiments,it is proved that the CApriori algorithm has better time performance than the Apriori algorithm on the compressed scientific and statistical database.About clustering mining,this paper proposes the C-Kmeans algorithm,which is a clustering algorithm operating directly on a compressed database,and this algorithm is a variant of Kmeans algorithm.Because the running time of the Kmeans algorithm is linearly related to the data records,the running time of the algorithm is mainly consumed in the I/O transmission.C-Kmeans algorithm operates the compressed database directly and saves a lot of time.Frequent pattern mining in vertical layout of the database,which has a lot of tidset intersection operation,results in a large number of intermediate results,requiring read and write disk frequently.In this paper,we propose a CONVTV compression algorithm for this problem.The compression algorithm uses two different formats to save vertical data records,and achieves a high compression ratio on most data sets.
Keywords/Search Tags:Science and statistics, compressed database, association rule mining, cluster mining, vertical data layout
PDF Full Text Request
Related items