Font Size: a A A

Clustering algorithms for data and knowledge exploration

Posted on:2004-02-07Degree:Ph.DType:Dissertation
University:The University of IowaCandidate:Gan, YuanFull Text:PDF
GTID:1468390011468552Subject:Engineering
Abstract/Summary:
Data mining—also referred to as knowledge discovery in databases (KDD)—has received a considerable attention in industrial engineering due to its applicability in industry. Among many data mining methods, clustering is attractive because it is simple to understand and easy to implement. However, the existing clustering algorithms restricted to solving limited data mining problems. Four limitations come from the feature type, feature set number, instance number, and cluster shape. In this research, a new clustering approach based on similarity measure is developed.; In this research, the clustering process is partitioned into three steps: similarity definition, feature preparation and clustering. In the similarity definition step, different types of similarity measures are considered, e.g., point-point similarity measure, point set, set-set, categorical features, summarized features, etc. The defined similarity measures are used in the final two steps. The major purpose of the feature preparation step is transforming (integrating, discretizing, etc.) feature sets and removing irrelevant and redundant features. A clustering algorithm determines clusters using the defined similarity measures. A new feature selection method based on similarity measure is proposed. Unlike traditional feature selection methods, the proposed algorithm is based on discrimination and similarity measure. The selected feature subsets have the same discrimination power as the original feature set and the minimum value of the corresponding similarity measure. The concept of mutual bonds and the triangularization algorithm are key to the new clustering algorithm. The new algorithm has low time computational complexity and low intermediate storage requirement. Finally, the proposed clustering method explores clustering problems of irregular shape. Many existing clustering algorithms cannot handle clusters of complex shapes. A new algorithm for efficiently finding the minimum spanning tree is developed. The edge lengths of the tree are defined by the similarity measure. Clusters are formed by separating the minimum spanning tree.; The main contribution of this research is the development of a formal approach for clustering. Similarity measures are critical to this approach. Computational experience on various data sets (benchmark data sets and industrial data sets) with the proposed approach has proven the efficiency, validity, and reliability of this approach.
Keywords/Search Tags:Data, Clustering, Similarity measure, Approach, Feature, Proposed
Related items