Font Size: a A A

Estimating the number of clusters in cluster analysis

Posted on:2007-11-30Degree:Ph.DType:Dissertation
University:North Carolina State UniversityCandidate:Dasah, Julius BerryFull Text:PDF
GTID:1459390005990956Subject:Statistics
Abstract/Summary:
In many applied fields of study such as medicine, psychology, ecology, taxonomy and finance one has to deal with massive amounts of noisy but structured data. A question that often arises in this context is whether or not the observations in these data fall into some "natural" groups, and if so, how many groups? This dissertation proposes a new quantity, called the maximal jump function, for assessing the number of groups in a data set. The estimated maximal jump function measures the excess transformed distortion attainable by fitting an extra cluster to a data set. By distortion, we mean the average distance between each observation and its nearest cluster center. Distortion dg in the above sense, is a measure of the error incurred by fitting g clusters to a data set. Three stopping rules based on the maximal jump function are proposed for determining the number of groups in a data set. A new procedure for clustering data sets with a common covariance structure is also introduced. The proposed methods are tested on a wide variety of real data including DNA microarray data sets as well as on high-dimensional simulated data possessing numerous "noisy" features/dimensions. Also, to show the effectiveness of the proposed methods, comparisons are made to some well known clustering methods.
Keywords/Search Tags:Cluster, Maximal jump function, Data
Related items