Spanning trees as tools for data analysis

Posted on:2008-06-10

Degree:Ph.D

Type:Dissertation

University:Rensselaer Polytechnic Institute

Candidate:Petrie, Adam

Full Text:PDF

GTID:1448390005451843

Subject:Statistics

Abstract/Summary:

We develop new methods to describe high-dimensional data based on the graph-theoretic concept of a spanning tree. The minimum spanning tree (MST), longest spanning tree (LST), and short Hamiltonian path (the "snake") are graphs that can be constructed on a dataset using Euclidean distances as edge lengths. These graphs require computational time that grows at worst linearly in the number of dimensions and quadratically with the number of datapoints. The node degree and edge length distributions of these trees serve as easy-to-visualize 1-dimensional summaries of multidimensional data, and statistics derived from these distributions can characterize key properties of the data.; Non-asymptotic theoretical properties of spanning trees are extremely difficult to derive analytically, so Monte Carlo simulations are necessary to examine the average edge length, node degree, and topological properties of trees with finite numbers of nodes. Building MSTs and LSTs on random data with up to a million points, we report the results of extensive simulations regarding the mean, standard deviation, and normality of node degree frequencies; the maximum node degree; the distribution of the longest edge length; and the mean, standard deviation, and normality of the total length of the tree, and compare with asymptotic results in the literature. Further, we analyze the correlation structure and run length distribution of snakes on uniform data.; We also present five new tree-based methods for the analysis of high-dimensional datasets. The maximum node degree in the LST is used for outlier detection. The standard deviation of the node degree distribution and a linear combination of the first three node degree frequencies are used to identify the implicit dimension of a dataset. The topology of the MST is used to find the center of the tree and a corresponding robust estimate of location. The topology of the MST is also used in the analytical derivation of the multisample generalization of the Friedman and Rafsky (1979) statistic for the multivariate two-sample test. Finally, run length statistics based on the sequence of segment lengths in the snake provide a test for multivariate uniformity almost regardless of the domain or shape of the data cloud.

Keywords/Search Tags:

Data, Spanning tree, Node degree

Related items

1	Research On Minimum Spanning Tree Algorithm For Degree Constraints Based On Multi - Populations
2	Research On Heuristic Algorithms For The Degree Constrained Minimum Spanning Tree Problem
3	Spanning tree algorithms for connectivity and routing in communication networks
4	Research On Clustering Algorithm Based On Minimum Spanning Tree
5	Wcj Genetic Algorithm For Degree-constrained Minimum Spanning Tree
6	WCJ Genetic Algorithm For Degree-Constrained Minimum Spanning Tree
7	Algorithm Research For The Constrained Minimum Spanning Tree Problem
8	Research On The Degree Constrained Minimum Spanning Tree Problem Based On Partheno-Genetic Algorithm
9	Research On Adaptive Evolution And Compound Crossover's Application In Genetic Algorithm
10	Fast Recovery Algorithms For Survivable Spanning Tree In Network