Font Size: a A A

Research And Application In Medical Studies Of Inverse Probability Weighting Based On Statistical Learning

Posted on:2019-11-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:W GuoFull Text:PDF
GTID:1360330542992011Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Background:It is an important issue in medical research to explore the causal effects between treatment/exposure factors and outcomes.Randomization trials are often considered the gold standard for estimating causal effects.In observational studies,subjects' treatment allocation mechanisms are usually not random and prone to many confounders.When comparing exposure effects between treatment groups,the effect estimates will be biased if those confounders are ignored.Inverse probability weighing(IPW)of marginal structural models(MSM)is an important statistical method that can be used to estimate the treatment effect in observational studies.When applying IPW,certain assumptions need to be met,such as no unmeasured confounders,positivity,stable unit treatment value assumption and no misspecification of weight estimation models.Weight estimation in the first stage is very crucial because final treatment effect estimates are very sensitive to the accuracy of inverse probability weights.If weight estimation model is misspecified,such as omitting quadratic or interaction terms of covariates,then the estimated weights are inaccurate,and may induce certain extreme weights,leading to biased effect estimates.In recent years,more and more researchers have recommended data adaptive methods such as many statistical learning algorithms to estimate inverse probability weights and achieved favorable results.However,current research is mostly focused on binary treatment and typical time-to-event data.Continuous treatments/exposures or time-dependent competing risk data are very common in medical practice.Inverse probability weighting for continuous treatments can be complicated by a number of issues not encountered in the binary treatment setting,including the need to identify a correct distributional form for the treatment,the need to deal with outliers that can make highly variable weights more likely.For time-dependent competing risk data,the inverse probability weights in marginal structural cause specific hazard models(MSCSHM)are cumulative product over multiple follow-up time points and even mild misspecification of the weight estimation model may yield severe biased effect estimates.Therefore,exploring the applicability of data-adaptive methods such as statistical learning algorithms in the inverse probability weighting for continous treatments data and time-dependent competing risk data has potential theoretical and practical value.Objective:(1)For observational data with continuous treatments,we explore the performance ofseven methods to construct inverse probability weights for continuous treatments in different scenarios via a series of Monte Carlo simulations.Moreover,taking the general linear model for instance,we investigate the influence of the weight truncation techniques on the effect estimates.(2)For time dependent competing risk data,we introduce eight statistical learning methods to construct inverse probability weighs of MSCSHM and examine the performance of these eight methods as well as logistic regression in different scenarios via a series of Monte Carlo simulations.Accordingly,the optimal weighting methods suitable for MSCSHM are determined.In addition,we also investigate the properties of estimated treatment effects under various levels of weight truncation.Method:For the above two research purposes,the research processes of data simulation,model building,model selection,and case application were respectively adopted.As follows:(1)Research on inverse probability weighting for continuous treatments.The Monte Carlo method was used to simulate the observational cohort data with continuous treatments.The simulation settings included three different sample sizes(250,1000 and 2500)and four different treatment status generation models(linearity and additivity;non-linearity;non-additivity;non-linearity and non-additivity).Seven statistical methods including general linear model(GLM),gamma regression model,quantile binning(QB),covariate-balancing propensity score(CBPS),nonparametric CBPS(npCBPS),boosted classification and regression trees(boosted CART)and random forest(RF)were used to estimate generalized propensity scores(GPS)and inverse probability weights.In addition,the stabilized weights(SW)estimated by GLM were respectively truncated at the two-sided 1% and 5% quantiles of weights' distribution to obtain the corresponding truncated weights.Further,the original samples were weighted by the estimated nine weight variables and the respective treatment effect estimates were obtained through the weighted outcome regression model.We evaluated the performance of the various weighting methods through average absolute correlation coefficient(AACC),relative bias,standard deviation(SD),standard error(SE)of the effect estimate,root mean squared error(RMSE)and 95% confidence interval(CI)coverage.Finally,we compared the appropriateness of different IPW methods by the analyzing the case of the effect of smoking on the total medical expenditure.(2)Research on inverse probability weighting for time-dependent competing riskdata.Firstly,the MSCSHM was employed as the basic analytical framework for time-dependent competing risk data.Eight statistical learning algorithms including LASSO,Bayesian logistic regression,CART,bagged CART,boosted CART,RF,support vector machine(SVM)and ensemble learner(EL)were introduced to construct the inverse probability weights in the model's first stage.Monte Carlo method was then used to simulate time-dependent competing risk data.The simulation settings consisted of two different sample sizes(250 and 1000),two different serial dependence of treatment history(autocorrelation coefficients are log(4)and 0.5,respectively),two different number of outcome events(2 and 3),and four different treatment status generation model(linearity and additivity;non-additivity;non-linearity;non-linearity non-additivity).The above statistical learning methods as well as logistic regression were used to estimate the stability weights.In addition,we also truncated the stabilized weights at two-sided 1%,5%,10%,25%,35%,and 50% quantiles of weights' distribution to investigate the properties of estimated treatment effects.Bias,SD,SE,RMSE and 95% CI coverage were used to evaluate performance of different IPW methods and determine the optimal method under different simulation scenarios.Finally,we applied the optimal method to estimate the treatment effect of the dynamic thiopurines regimen on the cause specific hazard of(I)occurrences of cancer and death and(II)deaths free of cancer in patients with inflammatory bowel disease to assess its appropriateness in real data analysis.Several traditional statistical methods were also used to analyze this cohort data and the results of different methods were compared.All the above study process was implemented via statistical analysis software R version 3.4.3.Results:1.Research on inverse probability weighting for continuous treatments.I.Simulation results:(a)in terms of covariate balance,CBPS performed best among all the investigated methods,followed by npCBPS.When utilizing untruncated stabilized weights estimated by GLM,the distribution of covariates in the weighted samples remained unbalanced.After removing the extreme weights by weight truncation technique,balance of measured covariates in samples weighted by GLM(1,99)were significantly improved.(b)Boosted CART and RF were less biased when weight estimation models were misspecified in varying degrees and were superior to other methods in terms of biasreduction.(c)It could be seen from the SDs of GLM,GLM(1,99)and GLM(5,95)that weight truncation technique can reduce the variance of the estimated treatment effect,and the variance gradually decreased as the truncation level increases.(d)Smaller RMSE of CBPS,npCBPS and boosted CART reflected higher precision of treatment effect estimates.The RMSE of GLM(5,95)and GLM(1,99)was less than the three methods mentioned above due to lower variance reduced by weight truncation.(e)As the complexity of the treatment status generation model increases,the 95% CI coverage of each method had decreased in varying degrees.Among all the weighting methods,GLM(1,99),CBPS,npCBPS and boosted CART were comparatively robust to model misspecification.II.Case study results: Through studying the effect of smoking on total medical expenditure,we present an analysis strategy for observational cohort data with continuous treatments using inverse probability weighting,that is,“Identification of the distribution of continuous treatments-Estimation of inverse probability weights-Examination of the distribution stabilized weights-Assessment of the covariate balance in weighted samples-Estimation of dose response function.” The results showed an overall rising amount of total medical expenditures as the amount of smoking increased;after controlling for confounders,the parameter estimation of treatment effect of smoking on total medical expenditures had decreased,and the standard error increased.The estimated effect obtained by boosted CART(1,99)weighting method had a marginally significant.However,the estimates obtained by the GLM(1,99)and RF(1,99)weighting methods still achieved statistical significance.(2)Research on inverse probability weighting for time-dependent competing risk data.I.Simulation results:(a)when the treatment status generation models only included the main effect term(linearity and additivity),the bias produced by the boosted CART was small,and the SDs and RMSE were the lowest.The SDs of the estimates produced by parametric logistic regression were larger and thus the precision of estimates were lower.(b)When the treatment status generation models included second-order interaction terms between covariates(non-additivity),the estimates generated by boosted CART and RF were very close in settings with both larger sample size and strong serial dependence of treatment history,and the properties of estimates were superior to other methods;RF achieved the lowest bias and RMSE in settings with smaller samples or modest serial dependence of treatment history.(c)When the treatment status generation models includednonlinear terms of covariates,boosted CART produced the lowest bias,RMSE,and highest95% CI coverage.(d)When the treatment status generation models included both second-order interaction terms and nonlinear terms,RF yielded the lowest RMSE and the best 95% CI coverage.(e)Weight truncation at the lower levels(eg,two-sided 1%quantiles)could further yield better estimates for both boosted CART and RF in terms of RMSE.(f)The optimal truncation levels for the majority of weight estimation methods were diverse in the same simulation settings;the optimal truncation levels for the same method were also different in various simulation settings.The choice of optimal truncation level was data-dependent.II.Case study results: Cox proportional hazards model with time-varying covariates,unweighted multinomial logistic regression models only adjusting for baseline covariates,unweighted multinomial logistic regression models adjusting for both baseline and time-varying covariates simultaneously,and MSCSHM based on logistic regression and boosted CART were employed to analyze dynamic thiopurines regimen on the cause specific hazard of two outcomes in patients with inflammatory bowel disease.Results showed that no matter whether the interested exposures were defined as thiopurines use during the past 3 months or cumulative use,no statistically significant exposure effects were demonstrated for all the above models.Conclusion:For observational data with continuous treatments,CBPS and npCBPS have better performance in terms of balancing covariates;statistical learning algorithms such as boosted CART and RF perform better in terms of reducing bias of estimates;when using IPW to estimate interested parameter,the distribution of stabilitzed weights should be first examined and weights should be truncated if there exists many outliers before estimating the outcome models in the second stage.For time-dependent competing risk data,when using the MSCSHM model to estimate dynamic treatment effects on the hazard of multiple outcomes,we suggest to give priority to boosted CART and RF to construct inverse probability weights in order to reduce the potential weight model misspecification when the treatment allocation mechanisms were unknown in real data analysis.
Keywords/Search Tags:causal effect, inverse probability weighting, marginal structural model, continuous treatment, time-dependent confounding, statistical learning
PDF Full Text Request
Related items