Low-storage sequential methods for data mining and the analysis of massive datasets

Posted on:2004-03-23

Degree:Ph.D

Type:Thesis

University:The Pennsylvania State University

Candidate:McDermott, James Patrick

Full Text:PDF

GTID:2468390011963544

Subject:Statistics

Abstract/Summary:

In this thesis we study low-storage, sequential estimators and methods for the analysis of massive datasets. A low-storage, sequential estimator is one that uses very low storage relative to the size of the entire dataset and can be updated sequentially as each new data point is observed. An example of an estimator of this type is the sample mean. We may update the sample mean sequentially by keeping track of the total number of points observed, n', and the sum of these points, i=1n' xi. Hence it is a low-storage, sequential estimator because it can be updated after each new point is observed by updating only these two quantities, n' and i=1n' xi. While moment-based examples such as this are helpful for illustrating the nature of the estimators we will be studying, it will in general be more useful to have estimators that are more robust to outliers, such as the sample quantiles. However, utilizing the sample quantiles introduces the additional burdens of increased storage and computation time. Hence we would like an estimator with properties similar to the sample quantiles, but which requires much less storage and computation time. We propose methods for the sequential computation of a single quantile and the simultaneous sequential estimation of an arbitrary set of multiple quantiles. Both methods are shown to estimate the population quantiles with accuracy and variability that is comparable to the sample quantiles. We utilize the quantile output from the extended quantile estimation algorithm for the investigation of various curve fitting techniques, such as smoothing and interpolating cubic splines, in an attempt to obtain an estimate of the entire empirical cumulative distribution function in a functional form that we may work with analytically. Upon obtaining this functional form; we then address the problem of sequential density estimation. Two methods are investigated: sequential kernel density estimation and density estimates through derivatives of the cubic spline fit. Both methods are shown to accurately estimate the unknown underlying density with the cubic spline derivative method having several advantages over the sequential kernel method such as much less computation time and the ability to obtain estimates over the entire range of the dataset. In dimensions of two or greater, quantiles have no direct analog. Instead, we study sequential versions of convex hull peeling algorithms as a way of finding and sequentially tracking the center and depth contours of a data cloud in two or more dimensions. We demonstrate the accuracy and reduced computation time required of the proposed methods by comparing to the existing convex hull peeling methods through simulation studies. Using the contours obtained as output from these algorithms, we use them to estimate the values of the underlying bivariate density for the points in the contours. We then fit a multi-dimensional spline to this three dimensional set of points to obtain an estimate of the entire bivariate density in a functional form.

Keywords/Search Tags:

Sequential, Methods, Low-storage, Data, Functional form, Density, Estimate, Computation time

Related items

1	The Study On Functional Verification Methods And Technologies For Sequential Circuits
2	Kernel Methods in Nonparametric Functional Time Serie
3	Real time optimization of chemical processes: A comparison of closed versus open form equations using commercial simulation packages
4	Study On The Structure And Properties Of Silicon Clusters Based On Density Functional Theory
5	Form Of Authentication Methods For Sequential Logic Circuits Research
6	Distinguishing forms of statistical density dependence and independence in animal time series data using information criteria
7	Research On The Methods Of Crowd Density Estimation And Motion State Analysis Based On Video Sequence
8	Research On The Classification Of Non Time-sequential Image And Time-sequential Speech
9	Research Of Probability Hypothesis Density Algorithm Based On Sequential Monte Carlo Method
10	Secure and verifiable outsourcing of computation and storage