High-dimensional data mining: Subspace clustering, outlier detection and applications to classification

Posted on:2011-06-02

Degree:Ph.D

Type:Thesis

University:University of Alberta (Canada)

Candidate:Foss, Andrew Philip Ogilvie

Full Text:PDF

GTID:2448390002968438

Subject:Computer Science

Abstract/Summary:

Data mining in high dimensionality almost inevitably faces the consequences of increasing sparsity and declining differentiation between points. This is problematic because we usually exploit these differences for approaches such as clustering and outlier detection. In addition, the exponentially increasing sparsity tends to increase false negatives when clustering.;In the field of outlier detection, several novel algorithms suited to high-dimensional data are presented (T*ENT, T*ROF, FASTOUT). It is shown that these algorithms outperform the state-of-the-art outlier detection algorithms in ranking outlierness for many datasets regardless of whether they contain rare classes or not. Our research into high-dimensional outlier detection has even shown that our approach can be a powerful means of classification for heavily overlapping classes given sufficiently high dimensionality and that this phenomenon occurs solely due to the differences in variance among the classes. On some difficult datasets, this unsupervised approach yielded better separation than the very best supervised classifiers and on other data, the results are competitive with state-of-the-art supervised approaches.The elucidation of this novel approach to classification opens a new field in data mining, classification through differences in variance rather than spatial location.;As an appendix, we provide an algorithm for estimating false negative and positive rates so these can be compensated for.;In this thesis, we address the problem of solving high-dimensional problems using low-dimensional solutions. In clustering, we provide a new framework MAXCLUS for finding candidate subspaces and the clusters within them using only two-dimensional clustering. We demonstrate this through an implementation GCLUS that outperforms many state-of-the-art clustering algorithms and is particularly robust with respect to noise. It also handles overlapping clusters and provides either 'hard' or 'fuzzy' clustering results as desired. In order to handle extremely high dimensional problems, such as genome microarrays, given some sample-level diagnostic labels, we provide a simple but effective classifier GSEP which weights the features so that the most important can be fed to GCLUS. We show that this leads to small numbers of features (e.g. genes) that can distinguish the diagnostic classes and thus are candidates for research for developing therapeutic applications.

Keywords/Search Tags:

Outlier detection, Data, Clustering, Mining, High-dimensional, Classification, Classes

Related items

1	The Researches On Related To Key Technologies Among Clustering Based On High-dimensional Data Space
2	A Study On Outlier Detection Algorithms For High Dimensional Data
3	Research On Outlier Data Mining In High Dimensional Space
4	Research Of Outlier Testing Methods In High-Dimensional Dataspace
5	Research On Outlier Detection Approach Of High-dimensional Sparse Data Based On Interpolation
6	The Research On A Few Key Issues In High Dimensional Data Mining
7	Analysis And Research Of Outlier Detection Algorithm For High Dimensional Data
8	Research And Implementation On Key Techlogy Of Data Stream Mining
9	Research On Classification Of Colleges In Our Country Based On Clustering Technology Of Data Mining
10	Research On Outlier Detection Algorithm For High Dimensional Big Data