Font Size: a A A

Comparisons Between Three Methods Of Variable Selection For High-dimensional Data Under Missing Data

Posted on:2019-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:R CuiFull Text:PDF
GTID:2429330545953114Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
The continuous development of information technology has greatly promoted the progress of data acquisition technology,and the demand for statistical analysis of high-dimensional data has become increasingly prominent.However,in the face of high-dimensional data,especially the typical "big P small n" problem,the effectiveness of traditional statistical methods has been challenged.Therefore,high dimensional data modeling and the corresponding model selection have become a hot topic,but the other difficult problem of high dimensional data,"data loss",should not be ignored.There are many mature methods for the selection of model and variable selection of high dimensional data.However,these methods take full data as the premise,and do not consider the absence of data.Therefore,in the face of high dimensional data filtering in the absence of data,we use the strategy of "filling and re screening first"(the complete set analysis will lose a large number of samples).Although a considerable number of missing data filling methods have been proposed,although these classical filling methods have good statistical properties,they are not suitable for actual data analysis.The traditional missing data filling method divides the data missing patterns into random deletion(MAR),complete random deletion(MCAR)and non random deletion(MNAR),and estimates the missing values by non parametric,semi parametric and Bayesian methods,and then fills the missing data by sampling and interpolation.This paper will introduce a method of missing data filling(Low-rank Matrix Completion)from the field of machine learning,which is different from the classical method of thinking,which has a wider range of application.In addition,with the support of the latest method,the calculation speed is faster.This paper also introduces the more mature MissGLasso model of multiple missing data variables screening method,Nicolas St h dler and Peter B u Hlmann(2011)based on MissGLasso model to propose two methods:MissGLasso filling method and MissGLasso2stage method.These three missing data processing methods are easy to handle multiple data missing statistical methods.This paper compares the new methods from statistics to the classical missing data processing methods,compares these theories and ideas,and compares the advantages and disadvantages of the three methods in the data simulation experiments and empirical studies,and studies the reasons for the defects.Then,the above three methods will be applied to the empirical study.The practical problems of gene selection in the microsequence data of Bacillus subtilis vitamin B2 output gene are taken as an example to test the actual effects of each method.At the end of this paper,we will also present some shortcomings of the current research and some ideas for future research.
Keywords/Search Tags:Missing data, high-dimensional data, variable selection, low rank matrix completion, MissGLasso
PDF Full Text Request
Related items