Font Size: a A A

Research On Adaptive And Robust Missing Value Imputation Algorithm

Posted on:2022-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:W L DongFull Text:PDF
GTID:2518306557975329Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,machine learning and data mining have gradually become a hotspot research field.With the rapid development of the internet techniques,the researchers face more and more data,however,many algorithms belonging to machine learning and data mining are required to be based on the complete data sets,this pre-condition brings the practical difficulty for the users who counter with the incomplete data.There have existed lots of data imputation algorithms to fill the missing data.However,most of the existing imputation methods are specifically designed for the static data,but ignoring the data in the form of flow,as well most of them use a single model,thus,the robustness are generally poor.How to extend the traditional imputation algorithms to the online dynamic data stream and to improve the robustness of the traditional algorithms have become the important issues In this thesis,the adaptive and robust missing value imputation algorithms are studied and developed.To solve the self-adaptability problem of the traditional imputation algorithms,two strategies based on the sliding time window are proposed to alleviate the filling error caused by concept drift,one is the ordinary average strategy,and the other is the log-weighted average strategy,i.e.,gradually increasing the weight of the instance on the time axis.Combining with the proposed strategies,three imputation algorithms are adopted,namely the mean imputation(MI),KNN imputation(KNNI)and Bayesian principal component analysis imputation(BPCAI),respectively.The experimental results indicate that the effectiveness of the strategies are independent with the specific imputation technique.To improve the robustness of the traditional imputation algorithms,the idea of ensemble learning is adopted with verifying on the gene expression data.First,the Pearson correlation coefficient is used to construct the correlation space,and then the space is divided into multiple random subspaces.Next,training ELM regression model on each random subspace.Finally,the mean value of all models is calculated as the imputation value of the missing value.It has proved that the proposed scheme is good at improving the adaptability and robustness of the missing value imputation algorithms.
Keywords/Search Tags:Missing value imputation, Data stream, Slide time window, ELM, Ensemble learning
PDF Full Text Request
Related items