Font Size: a A A

An Expectation Maximization Application For Decision Tree Classifiers On Datasets With Missing Values

Posted on:2011-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:Emmanuel Kayitaba A M NFull Text:PDF
GTID:2198330335991383Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The increasing presence of missing values in datasets is very common in real-world data mining and machine learning applications and ought to be addressed effectively. The missing values are caused by several reasons such as failure of measurement, denial of access, data loss, data loading issues, etc.To address the issues raised by the missingness of data in the datasets, several traditional approaches with unsatisfactory results have been taken into consideration while dealing with practical problem of missingness of data. The first and common traditional approach that comes to mind is to simply omit those cases with missing data and do the analysis with the remaining data. This approach is usually called listwise deletion, but it is also known as complete case analysis. Listwise deletion often results in a substantial decrease in the sample size available for the analysis. In particular, under the assumption that data are missing completely at random, it leads to unbiased parameter estimates.Another alterative (but still poor approach) to listwise deletion is the pairwise deletion approach:Many computer data mining packages come with the option of using what is generally known as pairwise deletion but also known as "unwise" deletion. With this feature, each element of the inter-correlation matrix is computed with the help of all available data. The shortcoming while dealing with this pairwise deletion approach is that the model parameter settings will be based on different sets of data, with different sample sizes and different standard errors. Thus it would be even quite possible to generate an intercorrelation matrix that is not positively definite, which is likely to compromise and halt the analysis. Pairwise deletion has been commonly suggested in case there are only a few missing observations.In the case of many missing observations, both listwise and pairwise deletion approaches can cause a lot of damage to the data analysis.The other and better approach to deal with the missingness of data is achieved by imputation, which is a solution that imputes (or estimates) missing values from the existing data. Imputation methods involve replacing missing values with estimated ones based on information available in the datasets. Imputation methods can be divided mainly into single and multiple imputation methods. In single imputation the missing value is replaced with only one imputed value while in multiple imputation, each missing value is replaced with a set of n plausible values. This thesis discusses the approaches in solving the issues raised by the missingness of data and proposes as best alternative a Bayesian Network imputation algorithm based on the expectation maximization approach to impute the missing values and therefore enhance the classification accuracy by a decision tree classifier. After the implementation and performance analysis of our designed expectation maximization Bayesian network imputation algorithm, we obtain satisfactory results meeting our research objectives.Following our experimental methodology approach to analyze and compare our expectation maximization Bayesian network algorithm's performance against the performances of three other canonical algorithms (the Mean, the naive Bayes and k-NN imputation algorithms) on several University of California Irvine datasets with different rates of missingness of data, we observe and conclude from the results that our designed algorithm has the best imputation performance to achieve the best classification accuracy. This is true when comparing our algorithm's performance against each of the other three algorithms's one by one but also true when comparing them altogether.Our algorithm's performance on the four original datasets (Audiology, Breast Cancer Wisconsin, Hepatitis and Mushroom) with the original (natural) data loss is greater than the performances of the other three algorithms, with the highest accuracy rate of 96.14% on the Breast Cancer Wisconsin dataset). After introducing different missing rates in 8 datasets and testing the accuracy of the four algorithms, our algorithm has the overall best performance on all the different missing rates and datasets compared to the 3 other algorithms. In all the 96 observations (a three dimension 8X3X4 observation matrix:8Datasets,3MissingRates and 4Algorithms), our algorithm has the best performance in 92 cases. Our algorithm has the best overall performance at the missing rate of 30% compared to the other 3 algorithms:At this missing rate it outperforms the Mean algorithm with 8.52% better accuracy in the Lymph dataset and outperformed the k-NN algorithm with 11.32% better accuracy in the Heart dataset,...
Keywords/Search Tags:imputation, decision tree classifiers, expectation maximization, bayesian network, missing data
PDF Full Text Request
Related items