Font Size: a A A

Research On Projected Clustering Algorithm And Its Applications

Posted on:2008-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z H ZhangFull Text:PDF
GTID:2178360218952863Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the expansion of the application field of clustering analysis, more and more high-dimensional and mixed-type data need to be processed. However, many existing methods are only effective for clustering low-dimensional data, and/or for specific-type data. To solve these two problems, this paper preliminary discusses clustering high-dimensional and mixed-type (including binary, categorical and numerical types) dataset.Firstly, each numerical-type attribute in the dataset was decretized by the presented method, Density-based Grouping Method (DGM) respectively, substituting its factual value to a interval tab. Secondly, the dataset was transformed to a transaction database(TDB) through numbering all the effective data, eliminating the missing value and adding transaction identification for each point. After defining the conception of the Longest Frequent Closed Itemset (LFCI), the transactions with the same LFCI were clustered utilizing the two key properties of LFCI ,i.e.,(1)it covered the transaction maximally, and (2)it was the description of the corresponding transaction. To accommodate the requirements of clustering, the traditional Frequent-Pattern Tree(FP-Tree) was recasted from three aspects, then the procedure of building the adapted FP-Tree was described detailedly. After analyzing the procedure of constructing adapted FP-Tree, a method to update FP-Tree was presented to reduce the space complexity, and a strategy to prune the invalid tree was educed to reduce the time complexity according to the characteristics of LFCI. LFCI-Growth which mined the LFCIs of each transaction was presented. After selecting one from multiply LFCIs of each transaction as its description, the all selected LFCIs were inserted into the defined cluster tree. A cluster included the associated dimensions and points. The former were the items in the path which was from the root node to the end one linked with transaction identifications, and the last were the linked identifications in the end node of cluster tree. Summarizing the above whole procedure to a framework of clustering high-dimensional and mixed-type data, it was a projected clustering method essentially.To validate the performance of Clustering Algorithm based on the Longest Frequent Closed Itemsets (CA-LFCI), extensive simulated experiments demonstrate that CA-LFCI is effective and efficient basically. In addition, CA-LFCI was applicated in two real datasets, Votes and Mushroom. The results show that, for Votes, accuracy is higher than 95%,maximal to 98.62%, the running time is less than 0.71 seconds; and for Mushroom, accuracy is higher than 97%,maximal to 99.8%,and its running time is less than 5.5 seconds, when setting the input parameter, minimal support to less than 22%. Finally, the results are usable and comprehensive.
Keywords/Search Tags:data mining, clustering analysis, high-dimensional data, mixed-type data, longest frequent closed itemset, adapted FP-Tree, minimal support
PDF Full Text Request
Related items