Random Forest Missing Data Algorithms In Big Data

Posted on:2019-03-08

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Yu

Full Text:PDF

GTID:2428330545967538

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

At this stage,there are often missing data for many real data sets,and the lack of data often brings a lot of trouble to data analysis,because a complete set of data must be needed in the process of many data analysis.This forces us to find an efficient and feasible way to deal with these missing data.Referring to many literatures at home and abroad,it is found that most of the existing methods for missing data processing only adapt to smaller datasets and lower data sets.It does not perform well in the face of processing genome,proteome,neuroimaging,and other large scale data and takes a lot of time to calculate.However,in today's society,because of the rapid development of science and technology,the research of large data is particularly important.Through the analysis of massive data,massive data will provide us with more valuable information.But now most of the data in enterprises are unstructured,and there are still many missing data.The analysis and research process for big data is very slow.Therefore,based on the ideal characteristics of the random forest itself,which can handle high dimensional data and be suitable for processing the lost data of mixed types,a method of efficient processing of missing data in large data environment is improved.By grouping variables,each group takes multiple response regression as the dependent variable,and the forest is constructed through multiple multiple splits,and the calculation speed is improved on the premise of ensuring the interpolation accuracy.In order to verify the feasibility and adaptability of the algorithm,40 different data sets are selected from the UCI and the genome database,and the existing random forest interpolation algorithms and the mainstream KNN and EM algorithms are compared.The performance of various missing data interpolation algorithms in the absence of missing data mechanism is evaluated and the data correlation is analyzed for the missing data interpolation accuracy.Experiments in this paper show that the random forest interpolation algorithm is robust in general and improves with the improvement of data correlation.Especially when data are not randomly absent,medium and high missing mechanisms is better.

Keywords/Search Tags:

missing data, big data, random forest, data imputation

PDF Full Text Request

Related items

1	Comparative Study On Imputation Methods Of Missing Data In XGBOOST Model Under Complete Random Missing Mechanism
2	Random Forest Missing Data Algorithms In Big Data
3	Studies On Missing Data Imputation
4	Nonparametric Imputation For Missing Data
5	Missing Data Imputation Using Boosting Generative Adversarial Nets
6	Research On Passenger Transport Data Quality Detection And Missing Data Imputation
7	Research On Key Technologies Of Missing Data Imputation In Wireless Sensor Networks
8	Research On Missing Data Imputation Based On Tensor Decomposition
9	Research And Implementation Of Imputation Method For Missing Data In The Trash Pickup Logistics Mangagement System
10	Research On Data Imputation Methods Of Mixed Missing Type