Research On Variable Screening Method Of Ultrahigh-Dimensional Under Complete Data

Posted on:2024-01-24

Degree:Master

Type:Thesis

Country:China

Candidate:Z Z Wang

Full Text:PDF

GTID:2530307139457004

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

Ultrahigh-dimensional data is widely available in the fields of genetics,biology,and genomics.Variable selection methods applicable to low and high dimensional data can cause problems in three aspects of computational complexity as well as statistical accuracy and stability of the algorithm due to the model under ultrahigh-dimensional data.Therefore,numerous variable selection methods for ultrahigh-dimensional data have been proposed one after another.In this paper,we propose a univariate screening method without model assumptions and two variable screening methods for ultrahigh-dimensional data sets under complete data assumptions,based on two typical methods for variable screening of ultrahighdimensional discrete data,namely,information gain ratio and Gini impurity,and discuss and study the limitations of the existing methods.After numerical simulations and empirical studies,it is found that the proposed method can accurately identify important variables,and its results are more reliable and feasible compared with the original method.First,a univariate screening method based on Gini impurity(PG-SIS)is proposed for discrete covariates under model-free assumptions.The adjusted purity gain index is applied as the screening statistic to measure the importance of each covariate.It is theoretically demonstrated that PG-SIS has feature screening consistency under canonical conditions,numerical simulations and classification results of glioma data also showed that PG-SIS can effectively identify covariates with high predictive power.Second,a group variable screening method based on Gini impurity(GPG-SIS)is proposed for model-free feature screening to construct screening statistics.This method is an extension of the univariate screening method based on Gini impurity to the group variable screening method,and improves the purity gain based on Gini impurity as a screening statistic so as to screen discrete group covariates,and theoretically demonstrates that GPG-SIS has feature screening consistency.Numerical simulations and analysis of Lung cancer data show that GPGSIS is more robust than univariate screening methods.Finally,a group variable screening method(GIGR-SIS)with information gain ratio as the screening statistic was constructed with complete data.The information gain ratio is constructed using the information gain multiplied by the adjustment factor to represent the amount of information between the grouping variable and the response variable,indicating the importance of the grouping variable in the covariate.And under certain canonical conditions,GIGR-SIS was theoretically demonstrated to have feature screening consistency.Numerical simulations and analysis of p53 cell line data showed that the screening robustness of GIGR-SIS was better than the existing grouping variable screening methods.

Keywords/Search Tags:

Ultrahigh-dimensional data, model-free, group variable screening, Gini impurity, information gain ratio

PDF Full Text Request

Related items

1	Variable Screening For Statistical Models With Ultrahigh Dimensional Data
2	Variable Screening Methods For Ultra-high Dimensional Categorical Covariates
3	Gini-Index Based Feature Screening For Ultrahigh Dimensional Catagorical Data
4	Model-Free Feahture Screening With Exposure Variable
5	Research And Application Of Variable Method For Ultrahigh Dimensional Data
6	Non-marginal Variable Screening For Additive Hazards Model With Ultrahigh-dimensional Covariates
7	Research On Feature Screening Method For Ultrahigh Dimensional Discriminant Analysis Data
8	Model-free Conditional Feature Screening For Case Ⅱ Interval-censored Failure Time Data With Ultrahigh Dimensional Covariates
9	Some Studies On Feature Screening Of Ultra-high-dimensional Longitudinal Data And Group Structured Data
10	Variable Selection And Feature Screening In High-dimensional Data