Partition clustering of high dimensional low sample size data based on p-values

Posted on:2009-09-01

Degree:Ph.D

Type:Thesis

University:Kansas State University

Candidate:von Borries, George Freitas

Full Text:PDF

GTID:2448390005452047

Subject:Statistics

Abstract/Summary:

This thesis introduces a new partitioning algorithm to cluster variables in high dimensional low sample size (HDLSS) data and high dimensional longitudinal low sample size (HDLLSS) data. HDLSS data contain a large number of variables with small number of replications per variable, and HDLLSS data refer to HDLSS data observed over time.;Clustering technique plays an important role in analyzing high dimensional low sample size data as is seen commonly in microarray experiment, mass spectrometry data, pattern recognition. Most current clustering algorithms for HDLSS and HDLLSS data are adaptations from traditional multivariate analysis, where the number of variables is not high and sample sizes are relatively large. Current algorithms show poor performance when applied to high dimensional data, especially in small sample size cases. In addition, available algorithms often exhibit poor clustering accuracy and stability for non-normal data. Simulations show that traditional clustering algorithms used in high dimensional data are not robust to monotone transformations.;The proposed clustering algorithm PPCLUST is a powerful tool for clustering HDLSS data, which uses p-values from nonparametric rank tests of homogeneous distribution as a measure of similarity between groups of variables. Inherited from the robustness of rank procedure, the new algorithm is robust to outliers and invariant to monotone transformations of data. PPCLUSTEL is an extension of PPCLUST for clustering of HDLLSS data. A nonparametric test of no simple effect of group is developed and the p-value from the test is used as a measure of similarity between groups of variables.;PPCLUST and PPCLUSTEL are able to cluster a large number of variables in the presence of very few replications and in case of PPCLUSTEL, the algorithm require neither a large number nor equally spaced time points. PPCLUST and PPCLUSTEL do not suffer from loss of power due to distributional assumptions, general multiple comparison problems and difficulty in controlling heterocedastic variances. Applications with available data from previous microarray studies show promising results and simulations studies reveal that the algorithm outperforms a series of benchmark algorithms applied to HDLSS data exhibiting high clustering accuracy and stability.

Keywords/Search Tags:

High dimensional low sample size, Clustering, HDLSS data, Algorithm, HDLLSS data, Variables, PPCLUSTEL

Related items

1	Novel Cheminformatics Methods for Modeling Biomolecular Data in High Dimension Low Sample Size (HDLSS) Chemistry Space
2	Classification methods for high-dimensional sparse data
3	Research And Application Of Rough Clustering Algorithm For High Dimensional Data Sets
4	Research And Application Of The Clustering Algorithm For High Dimensional Data
5	Research On Clustering Algorithm Based On High Dimensional Data
6	Research On High Dimensional Data Clustering Algorithm Based On Deep Learning
7	Comparative Study On Classification Methods Of Two High-Dimensional And Small Sample Data
8	Improvement Research Of Clustering Algorithm Based On High-dimensional Data
9	A Study Of Clustering And Data Analysis Methods Based On One-Dimensional SOM
10	Classification Of Non-equilibrium High-Dimensional Small Sample Data Based On RF And LSSVM Models