Font Size: a A A

A multiscale, geometric algorithm for non-parametric data exploration with an application to genomic data

Posted on:2008-11-21Degree:Ph.DType:Thesis
University:State University of New York at Stony BrookCandidate:McQuown, JosephFull Text:PDF
GTID:2448390005962672Subject:Statistics
Abstract/Summary:
This thesis presents an efficient and adaptive multi-scale algorithm for analyzing measurement data, composed of two categories: a regular set of measurements that can be described by means of a dominant geometry, and a set of "outliers", i.e., measurements that deviate from this underlying geometry. The algorithm uses a stopping-time construction in order to identify local regions of different sizes and shapes where the data is concentrated around local lines (or d-planes) and excluding local percentages of putative outliers that reside outside such regions. Thus it is able to construct efficiently a description of the dominant "geometry" in terms of a curve (or d-dimensional graph). Using the local geometric properties, it then detects the outliers. Our approach need not make any assumption about the distributional properties of the noise and it exhibits robustness against noise and outliers. Furthermore, the speed of our algorithm is linear in the size of the data and it can handle high-dimensional data without a blow-up of computational expense. Genomic expression data is an application that can be assayed quite well within this framework. This paper explores experimental results of such phenomena and describes some of the mathematical underpinnings of this algorithm and its various properties.
Keywords/Search Tags:Algorithm, Data
Related items