Font Size: a A A

Algorithmical and geometrical aspects of statistical depth

Posted on:2001-09-05Degree:Ph.DType:Thesis
University:Universitaire Instelling Antwerpen (Belgium)Candidate:Struyf, AnjaFull Text:PDF
GTID:2468390014958083Subject:Statistics
Abstract/Summary:
The first part of this thesis focuses on cluster analysis. Cluster analysis methods try to detect whether a data set consists of several groups. Our goal was to adapt a series of standalone Fortran programs, that are widely used by people in several domains, such that they meet today's standards. For this purpose, we have transformed them to an object-oriented library of clustering functions in S-PLUS. This library also contains graphical displays and indices to evaluate the goodness of the clustering.;The remaining chapters of the thesis discuss statistical depth. In statistics, depth generalizes the univariate concept of ranking to other settings, such as multivariate location and regression. The location depth ldepth (theta; Xn) of a point theta relative to a data set Xn = {x1,... xn} ⊂ Rp determines how central theta lies in the data cloud Xn. Points outside the convex hull of Xn have depth equal to zero, boundary points have low depth, and centrally located points have large depth values. The finite-sample definition of the location depth can easily be generalized to any probability distribution P on Rp . The regression depth rdepth(theta ; Zn) determines how well a hyperplane Htheta with coefficients theta fits a data set Zn = {(x 1, y1),..., (x n, yn)} in Rp . If the data are well-balanced around the hyperplane then the hyperplane has a large regression depth value, while hyperplanes that do not represent the data very well receive a low depth. This depth notion may again be generalized to any probability distribution P on Rp . The location and regression depth turn out to have many similar properties, theoretically as well as computationally.;First we describe an algorithm to compute the location depth of a given point theta when p = 3, as well as algorithms for the regression depth when p = 3 or 4. We prove the exactness of these algorithms. Their complexity is O( np-1 log n) which grows exponentially in p, making this approach unpractical for more than three dimensions. Therefore we also propose approximate algorithms for higher-dimensional data sets. A point with maximal location depth relative to the data set Xn can be used as a robust estimator of location. This deepest location T*l is a natural generalization of the univariate median to higher dimensions. We construct an approximate algorithm for the deepest location in any dimension.;Moreover, we prove some characterization properties. We show that the empirical distribution of the original data points is uniquely determined by the location depth function and by the regression depth function. We also discuss the relation between the depth function and symmetry properties of the original distribution P, which may be continuous as well as discrete.
Keywords/Search Tags:Depth, Data, Distribution
Related items