Font Size: a A A

Supervised and unsupervised statistical learning with massive heterogeneous data

Posted on:2004-03-01Degree:Ph.DType:Dissertation
University:Arizona State UniversityCandidate:Tuv, EugeneFull Text:PDF
GTID:1468390011971974Subject:Statistics
Abstract/Summary:
The research presented herein is fully motivated by current development of advanced data analysis tools at Intel Corporation. The core requirement for these tools is the ability to provide interpretable/visual analysis of massive heterogeneous (mixed type) datasets, which are often dirty with possibly large blocks of non-randomly missing data.; To enable usage of the latest advances in nonparametric additive predictive modelling with categorical predictors of large cardinality, we discuss an efficient, computationally fast, preprocessing mechanism to discover a small number of natural partitions of values for such variables that have similar statistical properties in terms of categorical response.; Traditional approaches to clustering on the datasets of mixed type use “weak” distance where nominal portion corresponds to simplistic matching and fails to make use of additional information provided by levels of nominal variables. We consider supervised contrasting independency (SCI) clustering of mixed type data of practically any complexity. SCI finds natural data partition with “interesting” clusters/nuggets capable of handling missing values.; In unsupervised settings, Self Organizing Maps (SOM) have proven to be extremely useful for visual data exploration. As a special kind of neural network, SOM works with metric distances, and therefore deals with numeric variables. For a multivariate dataset with variables of mixed type, we proposed an efficient procedure to enrich the original dataset by assigning numeric scores to the levels of nominal variables that attempt to preserve mutual information among all variables. Even though it was mainly motivated by the need for low-dimensional exploratory visualization of complex heterogeneous data, the proposed relatively simple preprocessing scheme can be useful in any distance-based learning. Two examples demonstrate this approach in instance-based supervised applications.; A comprehensive approach to fault detection and statistical process control is proposed. This method is based on the practical assessment that the joint distribution of the monitored variables is typically unknown and rarely multivariate normal. Furthermore, for modern semiconductor manufacturing, an approach should handle variables of different data types. Only the data describing an in-control (without fault) state of the process is available and a verifiable method is needed to assign a future observation to “in-control” or “out-of-control” state. Furthermore, successful fault detection leads immediately to fault diagnosis and a method to evaluate the contributors to a fault signal is proposed.; Several approaches are considered for imputation of missing values for sparse datasets with large blocks of non-randomly missing values: Classification and Regression Trees (CART)-based imputation, Tree-Structured SOM (TS-SOM) and SCI clustering neighborhood Nearest Neighbor methods.
Keywords/Search Tags:Data, Missing values, SOM, SCI, Heterogeneous, Supervised, Mixed type, Statistical
Related items