Research On Data Mining Algorithm Based On Compressed Database

Posted on:2018-05-25

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liu

Full Text:PDF

GTID:2348330533969808

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the current social and economic prosperity and the advance of science,a large amount of data has been accumulated in all aspects of society.In database such as science and statistics,there are many kinds of important data,such as scientific experiment results,geographical mapping,census and economic activities,and these data are often static,after entering the database will not be changed,and will be permanently reserved.It leads to that a large amount of data stored in such database,and the methods of querying,computing and analyzing on traditional database become very large and it is difficult for us to accept for I/O transmission on such databases.Therefore,compressing the massive database becomes an important research direction.At present,many researchers have put forward many algorithms for compressing database.However,there are few studies on data mining and analysis on compressed databases.The contribution of this paper is that we propose method about how to data mining efficiently on compressed database.This paper mainly includes the following four aspects:The scientific and statistical database has the characteristics of static,sparse,aggregation and repeatability.Based on these characteristics,we propose a new database compression algorithm which compresses data records in block,and we also analyze theoretically about this algorithm.Compared with other database compression algorithms through experiments,it's proved that the compression algorithm proposed in this paper has a high compression ratio in scientific and statistical databases.About mining association rules,this paper proposes a CApriori algorithm,which operates directly on compressed database.At the same time,this paper makes a theoretical analysis about the promotion of CApriori algorithm compared with Apriori algorithm.And through experiments,it is proved that the CApriori algorithm has better time performance than the Apriori algorithm on the compressed scientific and statistical database.About clustering mining,this paper proposes the C-Kmeans algorithm,which is a clustering algorithm operating directly on a compressed database,and this algorithm is a variant of Kmeans algorithm.Because the running time of the Kmeans algorithm is linearly related to the data records,the running time of the algorithm is mainly consumed in the I/O transmission.C-Kmeans algorithm operates the compressed database directly and saves a lot of time.Frequent pattern mining in vertical layout of the database,which has a lot of tidset intersection operation,results in a large number of intermediate results,requiring read and write disk frequently.In this paper,we propose a CONVTV compression algorithm for this problem.The compression algorithm uses two different formats to save vertical data records,and achieves a high compression ratio on most data sets.

Keywords/Search Tags:

Science and statistics, compressed database, association rule mining, cluster mining, vertical data layout

PDF Full Text Request

Related items

1	An Algorithm Based On Density And Grid For Mining And Clustering Association Rules
2	The Research On Algorithm For Association Rules Mining Based On Vertical Data Presentation
3	Research On Association Rule Mining Algorithm Based On Time-stamp And Vertical Format
4	Research And Application Of Association Rule In Data Mining
5	Studies And Applications Of Association Rule Mining Methods In Data Mining
6	Association Rule Mining Research And Application Of The Algorithm
7	Research On Mining Algorithm Of Association Rule And Its Application For Biological Data
8	Some Key Problems In The KDD
9	Research On The Problem Of Association Rule Mining In Incomplete Relational Database
10	The Improvement And Research For Association Rule Mining Aigorithm Based On Compressed Matrix