Research On Projected Clustering Algorithm And Its Applications

Posted on:2008-07-09

Degree:Master

Type:Thesis

Country:China

Candidate:Z H Zhang

Full Text:PDF

GTID:2178360218952863

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the expansion of the application field of clustering analysis, more and more high-dimensional and mixed-type data need to be processed. However, many existing methods are only effective for clustering low-dimensional data, and/or for specific-type data. To solve these two problems, this paper preliminary discusses clustering high-dimensional and mixed-type (including binary, categorical and numerical types) dataset.Firstly, each numerical-type attribute in the dataset was decretized by the presented method, Density-based Grouping Method (DGM) respectively, substituting its factual value to a interval tab. Secondly, the dataset was transformed to a transaction database(TDB) through numbering all the effective data, eliminating the missing value and adding transaction identification for each point. After defining the conception of the Longest Frequent Closed Itemset (LFCI), the transactions with the same LFCI were clustered utilizing the two key properties of LFCI ,i.e.,(1)it covered the transaction maximally, and (2)it was the description of the corresponding transaction. To accommodate the requirements of clustering, the traditional Frequent-Pattern Tree(FP-Tree) was recasted from three aspects, then the procedure of building the adapted FP-Tree was described detailedly. After analyzing the procedure of constructing adapted FP-Tree, a method to update FP-Tree was presented to reduce the space complexity, and a strategy to prune the invalid tree was educed to reduce the time complexity according to the characteristics of LFCI. LFCI-Growth which mined the LFCIs of each transaction was presented. After selecting one from multiply LFCIs of each transaction as its description, the all selected LFCIs were inserted into the defined cluster tree. A cluster included the associated dimensions and points. The former were the items in the path which was from the root node to the end one linked with transaction identifications, and the last were the linked identifications in the end node of cluster tree. Summarizing the above whole procedure to a framework of clustering high-dimensional and mixed-type data, it was a projected clustering method essentially.To validate the performance of Clustering Algorithm based on the Longest Frequent Closed Itemsets (CA-LFCI), extensive simulated experiments demonstrate that CA-LFCI is effective and efficient basically. In addition, CA-LFCI was applicated in two real datasets, Votes and Mushroom. The results show that, for Votes, accuracy is higher than 95%,maximal to 98.62%, the running time is less than 0.71 seconds; and for Mushroom, accuracy is higher than 97%,maximal to 99.8%,and its running time is less than 5.5 seconds, when setting the input parameter, minimal support to less than 22%. Finally, the results are usable and comprehensive.

Keywords/Search Tags:

data mining, clustering analysis, high-dimensional data, mixed-type data, longest frequent closed itemset, adapted FP-Tree, minimal support

PDF Full Text Request

Related items

1	Research Of An Algorithm For Frequent Closed Itemset Mining On Data Stream
2	Data-Mining Methods Study And Its Application In Tranditional Chinese Prescription Compatibility Analysis
3	Study Of Fast Algorithms For Frequent Itemset Mining From Uncertain Data
4	Research And Application On Association Rule Mining
5	The Research And Application Of Association Rules Mining Algorithms Based On Directed Itemset Graph
6	Research On Frequent And Closed High Utility Itemset Mining Algorithm Based On Spark
7	The Research On The Algorithm About Online Mining Closed Frequent Itemsets Over Data Stream
8	Research On Key Algorithms For Mining Frequent Patterns In Data Streams And Their Application In Simulation System
9	Research On Mining Frequent Itemsets Over Data Stream
10	An Algorithm For Mining Frequent Itemsets From Data Streams