Font Size: a A A

Finding multiple clustering structures in data, with applications to DNA microarrays

Posted on:2004-03-31Degree:Ph.DType:Dissertation
University:Stanford UniversityCandidate:Belitskaya, Ilana YolyevnaFull Text:PDF
GTID:1468390011975538Subject:Statistics
Abstract/Summary:
Cluster analysis is the art of discovering classes in data. Traditionally, the goal of cluster analysis has been to uncover the unknown clustering structure by partitioning the observations into a single set of clusters such that the observations within each cluster are more similar to one another than those assigned to different clusters. However, as the number of variables gets larger, it becomes increasingly unlikely for any pair of observations to be similar across all the variables simultaneously. In contrast, the observations tend to group better on small subsets of the variables. Moreover, different subsets of variables might induce different and potentially useful clustering structures of observations. In this work, the standard clustering problem of finding a single clustering structure of observations is first generalized to the problem of discovering multiple clustering structures and finding variables that induce them. Three dissimilarity measures based on entropy, empirical measures and interpoint-distance based graphs are proposed for clustering variables and their performance is compared to the widely used correlation-based dissimilarity. A procedure based on binning that makes the computation of the first two of these dissimilarity measures feasible is developed. We also propose a weighted distance two-way clustering method for discovering multiple clustering structures in the data and give a randomization test for similarity of clustering structures. The motivating application is to gene expression data.
Keywords/Search Tags:Clustering structures, Data, Finding, Multiple
Related items