Font Size: a A A

Research On Variable Screening Method Of Ultrahigh-Dimensional Under Complete Data

Posted on:2024-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z WangFull Text:PDF
GTID:2530307139457004Subject:Statistics
Abstract/Summary:PDF Full Text Request
Ultrahigh-dimensional data is widely available in the fields of genetics,biology,and genomics.Variable selection methods applicable to low and high dimensional data can cause problems in three aspects of computational complexity as well as statistical accuracy and stability of the algorithm due to the model under ultrahigh-dimensional data.Therefore,numerous variable selection methods for ultrahigh-dimensional data have been proposed one after another.In this paper,we propose a univariate screening method without model assumptions and two variable screening methods for ultrahigh-dimensional data sets under complete data assumptions,based on two typical methods for variable screening of ultrahighdimensional discrete data,namely,information gain ratio and Gini impurity,and discuss and study the limitations of the existing methods.After numerical simulations and empirical studies,it is found that the proposed method can accurately identify important variables,and its results are more reliable and feasible compared with the original method.First,a univariate screening method based on Gini impurity(PG-SIS)is proposed for discrete covariates under model-free assumptions.The adjusted purity gain index is applied as the screening statistic to measure the importance of each covariate.It is theoretically demonstrated that PG-SIS has feature screening consistency under canonical conditions,numerical simulations and classification results of glioma data also showed that PG-SIS can effectively identify covariates with high predictive power.Second,a group variable screening method based on Gini impurity(GPG-SIS)is proposed for model-free feature screening to construct screening statistics.This method is an extension of the univariate screening method based on Gini impurity to the group variable screening method,and improves the purity gain based on Gini impurity as a screening statistic so as to screen discrete group covariates,and theoretically demonstrates that GPG-SIS has feature screening consistency.Numerical simulations and analysis of Lung cancer data show that GPGSIS is more robust than univariate screening methods.Finally,a group variable screening method(GIGR-SIS)with information gain ratio as the screening statistic was constructed with complete data.The information gain ratio is constructed using the information gain multiplied by the adjustment factor to represent the amount of information between the grouping variable and the response variable,indicating the importance of the grouping variable in the covariate.And under certain canonical conditions,GIGR-SIS was theoretically demonstrated to have feature screening consistency.Numerical simulations and analysis of p53 cell line data showed that the screening robustness of GIGR-SIS was better than the existing grouping variable screening methods.
Keywords/Search Tags:Ultrahigh-dimensional data, model-free, group variable screening, Gini impurity, information gain ratio
PDF Full Text Request
Related items