Font Size: a A A

Variable Screening Methods For Ultra-high Dimensional Categorical Covariates

Posted on:2020-03-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:L NiFull Text:PDF
GTID:1360330596467926Subject:Statistics
Abstract/Summary:PDF Full Text Request
As a technique of dimension reduction,variable screening plays a critical role in ultrahigh-dimensional data analysis,and has been discussed widely in the literature over the past decade.When the response is either continuous or categorical,most of the existing variable screening methods explicitly or implicitly assume that all covariates are continuous.Huang,Li&Wang(2014)[37]first proposed a Pearson Chi-square based feature screening method(PC-SIS)tailored to a classification problem with ultrahigh dimensional categorical covariates,which is a common problem in practice but has seldom been discussed in the literature.Neither the original screening statistic nor its p-value adjustment works well when the numbers of categories of covariates are unequal.The main work in this doctoral thesis is described as follows:We develop a novel model-free feature screening procedure for ultrahigh-dimensional categorical covariates in a classification problem.A particular feature is that the numbers of categories of covariates are allowed to be not only different but also diverging.The number of categories of the response is allowed to diverge too.The screening index used in this method consists of the information gain in the decision tree algorithm ID3,and an adjustment factor defined as a reciprocal of the logarithm of the number of categories of each covariate.The screening method is denoted as IG-SIS.In this method,each variable screening statistic is a measure of the correlation between the response and a certain covariate,which evaluates the prediction of the covariate.Theoretically and empirically,we improve the Pearson Chi-square feature screening method and the tuning parameter selection proposed by Huang,Li&Wang(2014)[37].The improved screening statistic is defined as the original Pearson Chi-square screening statistic multiplied by the adjustment factor that is also used in IG-SIS.This variable screening method is called the adjusted Pearson Chi-square feature screening method(APC-SIS).We find that APC-SIS performs far better than PC-SIS when the numbers of categories of covariates are different.Missing covariate data is common in ultrahigh dimensional data analysis.It is quite a challenge to develop screening methods for incomplete data since the traditional ap-proaches to handling missing data cannot be directly applied to ultrahigh dimensional case.We provide a model-free approach to screening categorical covariates with ignor-able missing values(IMC-SIS).This variable screening method can be applied to the cases where there is a large number of covariates with and without missing values,where the missingness of one covariate value is related with the response and a small proportion of covariates without any missing value.This missingness mechanism is missing at random.We propose a two-step variable screening method.For each covariate with missing data,the first step screens out the variables in the unspecified propensity function.In the sec-ond step,the joint probability of covariate and response are estimated by leveraging the variables determined in the first step and special structure of categorical data.Given the joint probability estimates,we further define the screening statistics in order to pick up the covariates with good prediction power.In terms of theory,variable screening(selection)consistency is established for all the proposed variable screening methods.From a practical perspective,we examine all the variable screening methods in some different simulations.The results indicate(1)the performances of IG-SIS and APC-SIS are similar in the finite samples and these two screening methods have advantages over other existing variable screening methods designed for categorical data.(2)IMC-SIS successfully picks up covariates with good prediction power despite of larger missing proportion and higher correlation between the covariates.In addition,the proposed feature screening methods are applied to two datasets such as customers credit level and online recruitment,and the screening results are interpretable and beneficial for further analyses.
Keywords/Search Tags:ultrahigh-dimensional variable screening, number of categories, Chi-square statistic, entropy and information gain, covariate value missing at random
PDF Full Text Request
Related items