Font Size: a A A

Studies On Missing Data Imputation

Posted on:2008-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:X F ZhuFull Text:PDF
GTID:2178360215483331Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Missing data is an inevitable issue in real world applications. Missing data may be caused by, for example, impossible measured and lost during the data processing. Missing data can generate bias and affect the quality of learned patterns and the algorithm performance. Due to the fact that missing data imputation is extremely difficult, the input to the data mining algorithms is assumed to be without missing or incorrect values. This leaves a large gap between the available data and the machinery available to process the data.Methods for dealing with missing data can be broken down into the following categories: a) Case deletion; b) Learning with no handling with missing data; and c) Missing data imputation. Han, J., et al, think:'The method of imputation, however, is a popular strategy. In comparison to other methods, it uses as many information as possible from the observed data to predict missing value'.Commonly used imputation methods for missing data include parametric and non-parametric imputations. The parametric imputation is superior if a dataset can be adequately modeled parametrically, or if users can correctly specify the parametric forms for the dataset. If the model is misspecified (in fact, in real application, it is usually impossible for us to know the distribution of the real dataset), the estimations of parametric method may be highly biased and optimal control factor settings may be miscalculated. Nonparametric imputation can provide superior fits by capturing structure in the dataset.From the above-mentioned ananlysis, missing data imputation is an actual yet challenging problem faced by machine learning and data mining. This thesis is focused on three imputation issues as follows.Firstly, for imputing the missing data in objective attributes, we propose a nonparametric mixture-kernel-based iterative imputation method for attacking the issue of imputing missing data that the conditional attributes involve continuous and discrete attributes. Secondly, for imputing missing data in conditional attributes, an optimal imputation ordering strategy, which integrated economical criterion and the efficient information for constructing the imputation model, is designed to rank all missing values. And a cost-sensitive incremental algorithm for missing data imputation is proposed. Last but not least, GBKII (Grey-Based KNN Iteration Imputation) method, which is an instance-based imputation method, referred to a nonparametric method in statistics for handling with missing data in both conditional attributes and objective attributes. At last, we evaluate the performance of our methods using several UCI datasets and real datasets. The experimental results show that our approaches are superior to the existing methods.Compared to extant imputation methods, our approaches have the following advantages: (1) Mixture kernels can potentially give a much larger hypothesis space, with better extrapolation and interpolation abilities, compared with imputation models that use either a single kernel or a composite kernel designed to construct nonparametric kernel estimators; The method in which Minkowski's distance is replaced by grey relation analysis can increase the system performance. These two methods are novel both in techniques and in theory for imputing missing data. (2) All three methods, which can make best use of all observed information including the instances with missing data, are non-parmetric EM-like iterative imputation methods and can come over the disadvantages of single imputation methods and multiple imputatipon methods. In particular, these algorithms can converge faster than the existing EM algorithm in which both the E and M steps depend on parametric models. This idea enhances the theory in parametric iterative imputation methods due to taking account non-parametrci techniques into, and it also is an innovation because the non-parametric iterative imputation methods are proposed firstly. (3) The imputation ordering strategy considered the economical criterion to minimize the imputation cost, as well took the efficient information into account to make best use of the observed information in order to improve the performance of the imputation model.The rest of this dissertation is organized as follows. We briefly present a survey on the development of the data mining in Chapter 1. In Chapter 2, following a brief overview of missing data, we discuss sources, consequences, and the imputation methods for missing data. In successive chapters, such as, Chapter 3, 4, 5 respectively, we design three algorithms to impute missing data for missing in conditional attributes, objective attributes and missing in both conditional attributes and objective attributes respectively. We conclude this dissertation and have our future works in Chapter 6.
Keywords/Search Tags:Missing Data Imputation, Incremental Imputation, Iteration Imputation, Cost-sensitive, Mixture Kernels, Kernel Function, k-NN algorithm
PDF Full Text Request
Related items