Font Size: a A A

Generalized Propensity Score Method For High-dimensional Covariates And Its Application

Posted on:2022-09-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q GaoFull Text:PDF
GTID:1484306518474294Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objective: Propensity Score approach is promise to achieve causal effect estimation in observational or non-randomized controlled study by controlling for confounder bias.To identify unbiased causal effect of exposure/treatment on outcome,one of the assumptions that the observed data needs to satisfy is that there is no unmeasured confounders.This assumption is untestable and one generally needs to account for as many covariates as possible to make it plausible.In the “big data” era,a growing number of pretreatment covariates are able to be collected and analyzed,which induce high-dimensional problem.For binary treatment,there have been developed a variety of methods to solve this problem,which provides researchers with diversified choices for practical research.However,the mechanisms for achieving unbiased estimation of causal effects are different for each method,and the observed high-dimensional data is relatively complicated in practice.How to choose the most suitable statistical method for observed data is a new challenge faced by researchers.Therefore,before carrying out practical applications,it is necessary to clarify the statistical properties in different high-dimensional settings and applicable conditions of each method.In practice,it is common that the exposure/treatments of interest are continuous variables.In this case,the parameter of interest is dose-response function(DRF)of continuous treatment on outcome.If one simply transform continuous treatment to binary and then use typical PS method to estimate treatment effect,it will lead to the violation of the consistency assumption,and then cause estimation bias and even make a conclusion that is contrary to the facts.To solve this problem,researchers have proposed the Generalized Propensity Score(GPS).Similarly,in order to identify DRF,the observed data needs to satisfy unconfoundedness assumption,which can only be made reasonable by considering as many pre-treatment covariates as possible.Therefore,in practice,researchers often encounter high-dimensional covariates.In this setting,how to use GPS to estimate DRF is the second aim we expect to achieve.Methods: For binary treatment,firstly,we review propensity score methods in high-dimensional settings from the perspectives of variable/model selection and covariate balancing.Then eight methods were selected for simulation studies after considering calculation cost and the large sample property of each method.These methods include post-double-selection(DS),Double-Index Propriety Score(Di PS)and CTMLE-LASSO(Collaborative Targeted Maximum Likelihood Estimation)methods which directly use existing penalty functions;the outcome adaptive LASSO(outcome-adaptive LASSO,OAL)and Gli De R(Group Lasso and Doubly Roust Estimation)methods for the variable selection of causal inference that are achieved by modifying the existing penalty function;and balance-based methods including balance HD(Approximate residual balancing),RCAL(regularized calibrated estimation)and hd CBPS(high-dimensional covariate balancing propensity score).Simulation studies explored the impacts of different correlation structures among covariates(independent,moderate and strong correlations),different n/p ratios on the performance of each method in the settings where models were correctly specified and misspecified.Furthermore,when the model was correctly specified,the performance of each method under different confounding strength was analyzed.Model misspecification includes outcome model or/and treatment model misspecified.Performance of each method were assessed in terms of the ability to accurately identify confounders and prognostic covariates,covariates balancing and the accuracy,precision and effectiveness of the causal effect estimate.OAL is a variable selection method designed for causal inference rather than prediction.Theoretical and simulation studies also show that OAL can accurately identify confounders and prognostic predictors under certain conditions.At the same time,the nonparametric CBGPS(Nonparametric Covariate Balancing Generalized Propensity Score,np CBGPS)method is less dependent on the correct specification of GPS model and the distribution of error term.Based on these facts and above simulation results,we combined np CBGPS and OAL by constructing a dual weighted correlation coefficient,and proposed a generalized outcome adaptive LASSO method(G-OAL)to achieve an unbiased estimation of the DRF in high-dimensional setting.Simulation studies were conducted to explore limited sample properties of G-OAL.Simulation scenarios and parameters are the same as the above-mentioned binary simulation scenarios except that the treatment was continuous.Furthermore,the balance was evaluated using the mean absolute value of the weighted correlation coefficient.In order to facilitate the interpretation of the simulation results and explore the impacts of different covariate sets included in the GPS model on the performance of np CBGPS method,the GOAL method was compared with five np CBGPS weighting methods containing different covariate sets,among which the Targ weighting method which only contains confounders and prognostic covariates was reference.Finally,the DRF of accelerated epigenetic aging on AD(Alzheimer's disease)status was estimated to illustrate the application of GOAL.Results: For binary treatment,simulation studies showed that:(1)The Gli De R method was the most stable.In all simulation scenarios,the variable selection ability was close to the ideal.The estimated value has small bias and high precision,and the coverage corresponding to the estimated standard error was close to 95%;(2)The DS method always selected covariates that only predict treatment with a probability close to 1.When there was a high correlation between covariates,the frequency of selecting covariates that neither predict treatment nor outcome increased,which lead to the accuracy and precision decrease,especially when n/p decreases;(3)The OAL method accurately identified confounders and prognostic covariates.However,as the correlation between covariates increases,covariates that neither predict treatment nor outcome and covariates that only predict treatment also increase,resulting in a decrease in estimation accuracy and precision.This problem can be improved by appropriately adjusting ? convergence;(4)Di PS and CTMLE-LASSO have large biases in treatment effect estimates under all simulated scenarios.(5)hd CBPS was relatively stable in terms of estimation accuracy,and the bias was small.The estimation precision slightly decreased as the correlation between covariates increased.However,the estimated variance of hd CBPS was affected by the correlation between covariates and n/p.When the covariates were strongly correlated or n/p was small,there was a very large or even infinite variance estimate;(6)The balance HD method always performs optimally in terms of balance.When n/p was large,the accuracy and precision of the treatment effect estimate was equivalent to hd CBPS and the estimated variance was closer to the empirical variance.However,when n/p decreased,the bias increased significantly;(7)The RCAL-AIPW method obviously improved the estimation accuracy of RCAL-IPW,but the performance was still inferior to the balance HD and hd CBPS methods.For continuous treatment,when the treatment and outcome models were correctly specified,compared with the other four np CBGPS weighting methods,the Targ method was always the best or close to the best in terms of accuracy,precision and coverage of the treatment effect estimate.For the GOAL method,the variable selection was close to the ideal,so its balancing performance and the accuracy,precision,robust standard error and empirical standard error of the treatment effect estimate were close to the Targ weighting method.In addition,when n/p was large,the variance estimated by the bootstrap method was closer to the empirical variance than the robust variance,so its coverage was closer to the nominal value.When n/p was small,the variance estimated by the bootstrap method was significantly larger than the robust variance.When the outcome model was correctly specified and the treatment model was misspecified,Targ weighting method estimated treatment effect with small bias and high precision,but the variance was underestimated.The variable selection ability of GOAL method was slightly decreased,but the bias and precision of its estimated treatment effect were still close to the Targ weighting method.When the outcome model was misspecified,the accuracy and precision of the Targ weighting method were significantly worse.Finally, the analysis of multiple data sets in multiple brain regions found that there was no statistically significant dose-response relationship between the acceleration of epigenetic aging on AD status.Conclusion: When the treatment is binary and there are potential high-dimensional pre-treatment covariates,the following suggestions are given based on the simulation results:(1)The Gli De R method is preferred to achieve unbiased estimation of causal effects;(2)One can appropriately adjust ? convergence when using OAL method,especially when there are strong correlation between covariates;(3)When there is a strong correlation between covariates or the n/p is small,it is not recommended to use the DS method first;(4)Consider using robust variance when using hd CBPS method;(5)The balance HD method is recommended to be used when n/p is large;(6)Di PS,CTMLE-LASSO,RCAL-IPW and RCAL-AIPW methods are not recommended.Regardless of whether the outcome model and the treatment model are specified correctly or not,the performance of the GOAL method is always close to the Targ weighting method.When the outcome model is correctly specified,GOAL estimate DRF with small bias and high precision.When the n/p is large,it is recommended to use the bootstrap method to estimate the variance,and when the n/p is small,robust variance is recommended.When the outcome model is misspecified,the np CBGPS performs poorly even if it include all confounders and prognostic covariates.This implied that np CBGPS can only be robust to model misspecification when the outcome model is correctly specified.In terms of balance,when the outcome model is correctly specified,the weighted correlation coefficient performs well,but it is not that the better the balance,the better the accuracy and precision of the treatment effect estimate.Therefore,in practical,the quality of GPS or IPW weight estimation cannot be judged only by the balance.When the outcome model is misspecified,the weighted correlation coefficient cannot fully assess the balance.It is necessary to propose new comprehensive metrics to evaluate the balance.Based on the above simulation studies and application,the analysis process using the GOAL method can be summarized as follows: First,the high-dimensional covariates are initially reduced to medium dimensions(p<=n).Second,GOAL method with default tuning parameters is used for variable selection and dose-response function estimation.Third,the balances are evaluated.If the covariates are not independent and the balance of the weighted sample is significantly improved,the analysis results can be used directly.If the balance of covariates cannot be improved by weighting,one can adjust the ? convergence until a satisfactory balance is achieved.The adjustment basis is the effect value of covariates on outcome after adjusting for treatment and the correlation between covariates.
Keywords/Search Tags:Propensity score, Generalized Propensity Score, high-dimensional pre-treatment covariates, causal inference, penalty
PDF Full Text Request
Related items