Research On Strategy Of Imputing Missing Data Based On Random Forest

Posted on:2017-08-18

Degree:Master

Type:Thesis

Country:China

Candidate:H J Chen

Full Text:PDF

GTID:2348330488981541

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Data quality makes an important contribution to information expression as data is the carrier of information. In the evaluation system towards data quality, data integrity is one of the most important elements. However, during the process of data collection and storage in reality, it happens frequently that data is incomplete due to various reasons, which will affect follow-up research. Present existing data mining methods discover knowledge from data mainly based on complete data, with the missing rate of data continues to rise, it will bring strong negative effects to the accuracy of mined knowledge.In order to eliminate the negative impact on data mining caused missing data that we preprocess the mined data by constructing predictive models to impute missing data based on random forest. Random forest is an effective method to estimate missing values through automatically measuring the proximity between instances with its excellent classification performance.However, the similarity measure between instances of random forest in missing data imputation is not comprehensive and accurate enough currently. Therefore, firstly, in this paper we present an improved metric of similarity computing by taking the influence of distance of decision tree nodes on proximity into consideration to the problem that random forest simply calculates the proximity between instances with a "rough binary" measuring standard. Meanwhile, to guarantee the robustness of our proposed algorithm and make the most of the original data information, we improve the imputing strategy of random forest by considering the information of incomplete data and introducing the idea of kNN algorithm. The result of experiments on five datasets shows that the combination of our improved proximity metric and imputing strategy gives a better performance than the original way on the accuracy of estimating the missing values and verifies the effectiveness of our proposed methods.Secondly, Secondly, the accuracy of final imputation and the convergence rate will be affected by taking median/mode imputation as the initialization method of random forest to estimate missing values for that it will change the original data distribution to some extent. Thus, in this paper we take the priori knowledge of dataset into random forest imputation mechanism for optimization by introducing naive Bayes’ theorem instead of median/mode imputation as the initial guess of random forest model to form a temporary complete dataset. The result of experiments on four UCI datasets shows that our proposed method can do better on missing values prediction and convergence rate.

Keywords/Search Tags:

missing data imputation, random forest, proximity matrix, bayes theory

PDF Full Text Request

Related items

1	Comparative Study On Imputation Methods Of Missing Data In XGBOOST Model Under Complete Random Missing Mechanism
2	Nonparametric Imputation For Missing Data
3	Studies On Missing Data Imputation
4	Random Forest Missing Data Algorithms In Big Data
5	Research On Passenger Transport Data Quality Detection And Missing Data Imputation
6	Missing Data Imputation Using Boosting Generative Adversarial Nets
7	Several Research On Random Forest Improvement
8	Research On Data Imputation Methods Oriented Specific Domains
9	The Analysis And Improvement Research Of Knn-imputation Algorithm
10	Attribute Associated Neuron Modeling And Missing Value Imputation Based On Neural Network