Comparisons Between Three Methods Of Variable Selection For High-dimensional Data Under Missing Data

Posted on:2019-03-14

Degree:Master

Type:Thesis

Country:China

Candidate:R Cui

Full Text:PDF

GTID:2429330545953114

Subject:Applied statistics

Abstract/Summary:

PDF Full Text Request

The continuous development of information technology has greatly promoted the progress of data acquisition technology,and the demand for statistical analysis of high-dimensional data has become increasingly prominent.However,in the face of high-dimensional data,especially the typical "big P small n" problem,the effectiveness of traditional statistical methods has been challenged.Therefore,high dimensional data modeling and the corresponding model selection have become a hot topic,but the other difficult problem of high dimensional data,"data loss",should not be ignored.There are many mature methods for the selection of model and variable selection of high dimensional data.However,these methods take full data as the premise,and do not consider the absence of data.Therefore,in the face of high dimensional data filtering in the absence of data,we use the strategy of "filling and re screening first"(the complete set analysis will lose a large number of samples).Although a considerable number of missing data filling methods have been proposed,although these classical filling methods have good statistical properties,they are not suitable for actual data analysis.The traditional missing data filling method divides the data missing patterns into random deletion(MAR),complete random deletion(MCAR)and non random deletion(MNAR),and estimates the missing values by non parametric,semi parametric and Bayesian methods,and then fills the missing data by sampling and interpolation.This paper will introduce a method of missing data filling(Low-rank Matrix Completion)from the field of machine learning,which is different from the classical method of thinking,which has a wider range of application.In addition,with the support of the latest method,the calculation speed is faster.This paper also introduces the more mature MissGLasso model of multiple missing data variables screening method,Nicolas St h dler and Peter B u Hlmann(2011)based on MissGLasso model to propose two methods:MissGLasso filling method and MissGLasso2stage method.These three missing data processing methods are easy to handle multiple data missing statistical methods.This paper compares the new methods from statistics to the classical missing data processing methods,compares these theories and ideas,and compares the advantages and disadvantages of the three methods in the data simulation experiments and empirical studies,and studies the reasons for the defects.Then,the above three methods will be applied to the empirical study.The practical problems of gene selection in the microsequence data of Bacillus subtilis vitamin B2 output gene are taken as an example to test the actual effects of each method.At the end of this paper,we will also present some shortcomings of the current research and some ideas for future research.

Keywords/Search Tags:

Missing data, high-dimensional data, variable selection, low rank matrix completion, MissGLasso

PDF Full Text Request

Related items

1	The Study Of Imputation Methods For Missing Values Based On LASSO In Compositional Data
2	Variable Selection Methods Based On Penalized Likelihood Function And Their Applications In High-dimensional Model
3	Study On Variable Selection In Expectile Regression And Optimal Portfolio Selection
4	Determination Methods Of Disaster Assessment Variables And Weights Under Missing Data
5	Data Processing Theories And Methodologies In Pair-Wise Comparison Decision Matrices
6	The Missing Data's Effect On Micro Econometrics
7	Theoretical Research And Empirical Analysis Of High-dimensional Integral Volatility Matri
8	Study On Product CTQ Identification Based On Feature Selection
9	High-dimensional Data-driven Credit Risk Evaluation Of Online Loan
10	Self-starting Statistical Control Charts For High-dimensional Process Mean And Covariance Matrix