Font Size: a A A

Low-storage sequential methods for data mining and the analysis of massive datasets

Posted on:2004-03-23Degree:Ph.DType:Thesis
University:The Pennsylvania State UniversityCandidate:McDermott, James PatrickFull Text:PDF
GTID:2468390011963544Subject:Statistics
Abstract/Summary:
In this thesis we study low-storage, sequential estimators and methods for the analysis of massive datasets. A low-storage, sequential estimator is one that uses very low storage relative to the size of the entire dataset and can be updated sequentially as each new data point is observed. An example of an estimator of this type is the sample mean. We may update the sample mean sequentially by keeping track of the total number of points observed, n', and the sum of these points, i=1n' xi. Hence it is a low-storage, sequential estimator because it can be updated after each new point is observed by updating only these two quantities, n' and i=1n' xi. While moment-based examples such as this are helpful for illustrating the nature of the estimators we will be studying, it will in general be more useful to have estimators that are more robust to outliers, such as the sample quantiles. However, utilizing the sample quantiles introduces the additional burdens of increased storage and computation time. Hence we would like an estimator with properties similar to the sample quantiles, but which requires much less storage and computation time. We propose methods for the sequential computation of a single quantile and the simultaneous sequential estimation of an arbitrary set of multiple quantiles. Both methods are shown to estimate the population quantiles with accuracy and variability that is comparable to the sample quantiles. We utilize the quantile output from the extended quantile estimation algorithm for the investigation of various curve fitting techniques, such as smoothing and interpolating cubic splines, in an attempt to obtain an estimate of the entire empirical cumulative distribution function in a functional form that we may work with analytically. Upon obtaining this functional form; we then address the problem of sequential density estimation. Two methods are investigated: sequential kernel density estimation and density estimates through derivatives of the cubic spline fit. Both methods are shown to accurately estimate the unknown underlying density with the cubic spline derivative method having several advantages over the sequential kernel method such as much less computation time and the ability to obtain estimates over the entire range of the dataset. In dimensions of two or greater, quantiles have no direct analog. Instead, we study sequential versions of convex hull peeling algorithms as a way of finding and sequentially tracking the center and depth contours of a data cloud in two or more dimensions. We demonstrate the accuracy and reduced computation time required of the proposed methods by comparing to the existing convex hull peeling methods through simulation studies. Using the contours obtained as output from these algorithms, we use them to estimate the values of the underlying bivariate density for the points in the contours. We then fit a multi-dimensional spline to this three dimensional set of points to obtain an estimate of the entire bivariate density in a functional form.
Keywords/Search Tags:Sequential, Methods, Low-storage, Data, Functional form, Density, Estimate, Computation time
Related items