Font Size: a A A

Principal component analysis in phylogenetic tree space

Posted on:2017-01-23Degree:Ph.DType:Dissertation
University:The University of North Carolina at Chapel HillCandidate:Zhai, HaojinFull Text:PDF
GTID:1458390005491732Subject:Statistics
Abstract/Summary:
Complex data objects arise in many fields of modern science including drug discovery, psychology, dynamics of gene expression and anatomy. Object oriented data analysis describes the statistical analysis of a population of complex data objects. The specific case of tree-structured data objects is a large end promising research area with many interesting questions and challenging problems. This dissertation focuses on principal component analysis in the tree space introduced by Billera, Holmes, and Vogtmann.;Principal component analysis has been a widely used method in aiding visualization and reducing dimensions, and it is natural to extend this type of analysis into tree space. In this dissertation, we will discuss three interesting approaches to this extension. The first approach is multidimensional scaling, which focuses on better visualization of data in tree space, in particular, the out-of-sample embedding problem which inserts additional points into previously constructed multidimensional scaling configurations. It is shown that a better visualization can be achieved by choosing a higher dimensional embedding space and displaying only the first two dimensions. The other two approaches rely on our novel definitions of tree space line, and it is proven that there are only two types of such lines. The second approach is sample-limited geodesic which is an analog of the first type of line. This idea defines the first principal component for a set of trees by maximizing the data projection variance over geodesic segments connecting pairs of trees. Our study shows that the sample-limited geodesic is not an effective principal component object in terms of capturing data variation, due to the intrinsic geometry of the data used in this dissertation, and it is not natural to be generalized into higher-order principal component objects. The third approach is based on the principal ray set, which is a representative of the second type of line. We develop some heuristic searching algorithms for first order principal ray sets and higher order principal axis sets, which are special cases of principal ray sets. Principal ray sets are better summaries for less variable data, but gain very limited information for data with larger spread.
Keywords/Search Tags:Principal, Data, Tree space
Related items