| At this stage,there are often missing data for many real data sets,and the lack of data often brings a lot of trouble to data analysis,because a complete set of data must be needed in the process of many data analysis.This forces us to find an efficient and feasible way to deal with these missing data.Referring to many literatures at home and abroad,it is found that most of the existing methods for missing data processing only adapt to smaller datasets and lower data sets.It does not perform well in the face of processing genome,proteome,neuroimaging,and other large scale data and takes a lot of time to calculate.However,in today's society,because of the rapid development of science and technology,the research of large data is particularly important.Through the analysis of massive data,massive data will provide us with more valuable information.But now most of the data in enterprises are unstructured,and there are still many missing data.The analysis and research process for big data is very slow.Therefore,based on the ideal characteristics of the random forest itself,which can handle high dimensional data and be suitable for processing the lost data of mixed types,a method of efficient processing of missing data in large data environment is improved.By grouping variables,each group takes multiple response regression as the dependent variable,and the forest is constructed through multiple multiple splits,and the calculation speed is improved on the premise of ensuring the interpolation accuracy.In order to verify the feasibility and adaptability of the algorithm,40 different data sets are selected from the UCI and the genome database,and the existing random forest interpolation algorithms and the mainstream KNN and EM algorithms are compared.The performance of various missing data interpolation algorithms in the absence of missing data mechanism is evaluated and the data correlation is analyzed for the missing data interpolation accuracy.Experiments in this paper show that the random forest interpolation algorithm is robust in general and improves with the improvement of data correlation.Especially when data are not randomly absent,medium and high missing mechanisms is better. |