| Data quality makes an important contribution to information expression as data is the carrier of information. In the evaluation system towards data quality, data integrity is one of the most important elements. However, during the process of data collection and storage in reality, it happens frequently that data is incomplete due to various reasons, which will affect follow-up research. Present existing data mining methods discover knowledge from data mainly based on complete data, with the missing rate of data continues to rise, it will bring strong negative effects to the accuracy of mined knowledge.In order to eliminate the negative impact on data mining caused missing data that we preprocess the mined data by constructing predictive models to impute missing data based on random forest. Random forest is an effective method to estimate missing values through automatically measuring the proximity between instances with its excellent classification performance.However, the similarity measure between instances of random forest in missing data imputation is not comprehensive and accurate enough currently. Therefore, firstly, in this paper we present an improved metric of similarity computing by taking the influence of distance of decision tree nodes on proximity into consideration to the problem that random forest simply calculates the proximity between instances with a "rough binary" measuring standard. Meanwhile, to guarantee the robustness of our proposed algorithm and make the most of the original data information, we improve the imputing strategy of random forest by considering the information of incomplete data and introducing the idea of kNN algorithm. The result of experiments on five datasets shows that the combination of our improved proximity metric and imputing strategy gives a better performance than the original way on the accuracy of estimating the missing values and verifies the effectiveness of our proposed methods.Secondly, Secondly, the accuracy of final imputation and the convergence rate will be affected by taking median/mode imputation as the initialization method of random forest to estimate missing values for that it will change the original data distribution to some extent. Thus, in this paper we take the priori knowledge of dataset into random forest imputation mechanism for optimization by introducing naive Bayes’ theorem instead of median/mode imputation as the initial guess of random forest model to form a temporary complete dataset. The result of experiments on four UCI datasets shows that our proposed method can do better on missing values prediction and convergence rate. |