| Background:Bayesian Additive Regression Trees(BART),tree-based regression models,have an ability to flexibly fit interactions and nonlinearities.The probability model-based approach show much promise versus their algorithmic counterparts,and the generalization ability became stronger because of the ensemble trees.However,there are still some limitations in missing data processing and tree structure sampling.Existed survival analysis models based on BART are too complex to limit its applications.Therefore,it is necessary to extend the methodology of BART in missing data processing and tree structure sampling,and modelling in survival analysis based on BART in simpicity.Objectives:(1)BART deals with covariate missing data under the mechanism of Missing Completely At Random(MCAR),Missing At Random(MAR)and Not Missing At Random(NMAR)mechanism and extend and optimize the tree structure sampling to improve the prediction performance of the model.(2)To build a simple model of survival analysis based on BART for the right-censored data to improve its applicability in survival analysis.Methods:(1)For the missing data,the technique of "Missing Incorporated in Attributes"(MIA)was introduced into BART to solve the prediction in BART under three different missing data mechanism of covariates.(2)To extend and optimize the tree structure sampling in BART by removing the swap proposal,resetting the probability of GROW,PRUNE and CHANGE proposals,modifying the selection rules of the internal nodes in CHANGE proposal:those that have two terminal child nodes,and optimizing the acceptance rate of tree samplings.(3)To evaluate the extensions of Bayesian Additive Regression Trees(MTBART)through simulations and real cases,and use the models for the data of continuous response variables and binary classification response variables.(4)To build survival analysis model of simpicity by introducing the order statistics of the right-censored,evaluate the models by simulations,and use it for a real case.Results:(1)The stochastic search for splitting rules allowed observations with missingness to be grouped with observations having similar response values.Due to the Metropolis-Hastings step based on MIA rule in BART,the algorithm attempted to move towards splitting rules and corresponding group that increased overall model likelihood P(Y | X,M).(2)The acceptance rate of tree sampling was about 40%with stability using Friedman’s five dimensional function with N =200,500,1000 and 2000,respectively,for the tree structure simulations.(3)The convergence diagnosis showed that MTBART achieved fast convergence.The simulations of predictive accuracy showed that the Root Mean Square Error(RMSE)quantiles(50%,75%)of MTBART were(0.90,0.95),(0.93,0.98),(0.99,1.06),respectively,which were significantly lower than RMSE of BART:(1.25,1.31),(1.46,1.52),(1.62,1.68),respectively.The evaluation with six real data sets showed that RMSE of MTBART were 6.072、3.003、4.105、0.627、0.715 and 3.091 respectively,which were significantly lower than those of BART:6.745、4.506、4.129、0.653、0.764 and 3.267 respectively,(4)In the study of HbAl with continuous response variables,the coverage rate was 90%with 95%confidence interval,the coverage rate was 99%with 95%predicted interval using 100 test samples.In the study of breast cancer with binary classification response variables,the results of benign and malignant classification for 680 cases was:Accuracy:0.975;Error rate:0.025;Recall:0.973;Positive Precision value:0.989;Specificity:0.979;Negative predicted value:0.951,respectively.The probability estimation of 3 test data of malignant breast cancer are 0.921,0.918 and 0.932,respectively,which indicated that the predictive accuracy of MTBART were high.(5)Below is survival Analysis Model based on Bayesian Additive Regression Trees:δ_i is an indicator distinguishing events(δ = 1)from right-censoring(δ = 0)Model evaluation by simulations:① One-sample scenario,N=200 and censored rate=50%,the coverage probability:SURBART:0.96,Kaplan-Meier(KM):0.95;Bias:SURBART:0.002,KM:0.005;RMSE:SURBART:0.034,KM:0.035.② Two-sample scenario,N=400 and censored rate=50%,the coverage probability:SURBART:0.97,KM:0.96;Bias:SURBART:-0.007,KM:-0.004;RMSE:SURBART:0.043,KM:0.049.The results showed that the posterior interval from SURBART model still had a better coverage probability,and the RMSE was slightly lower while the deviation is similar.When two populations arise,SURBART model can estimate the difference of parameters at one time,while KM needs two estimates.③ Performance in regression scenarios with and without proportional hazards,in the PH scenario,as expected,Cox regression analysis performed very well with respect to bias as well as RMSE.It was worth noting that the SURBART was reasonably close to this in its performance.On the other hand,in nPH scenario,the bias and RMSE from SURBART were significantly lower than those from Cox regression.④ Regression scenario with highly nonlinear relationship with covariates,we used Friedman’s five dimensional test function to generate data of highly nonlinear relationship with covariates.SURBART fitted well the complex functional relationship of covariates to survival probabilities.(6)A retrospective cohort data set with 845 patients looking at survival after a reduced intensity hematopoietic cell transplant from an unrelated donor from 2000 to 2007,and we considered 13 co variates.SURBART model results showed that among three condition regimen of improvement the survival rate of patients,the best was fludarabine and cyclophosphamide,followed by the fludarabine and busulfan,the worst was fludarabine and melphalan.There was no interaction between the type of transplantation and age.In the 3-year survival period,the therapeutic effect of methotrexate was significantly better than mycophenolate mofitol.Conclusions:(1)The predictive performance of Bayesian Additive Regression Trees can be improved by processing missing data of covariates and optimizing tree structure sampling,which also can be used to evaluate variables importance,partial dependence of the variables and the interactions easily and effectively.The model has good usability.(2)We build a simple survival analysis model based on the Bayesian additive regression trees,which can be used to fit the complex relationship between covariate and the survival probabilities,including non-linear,high dimensional parameter space,without any distributional assumptions or any assumptions about proportional hazards.The model can also be used to select important covariates,obtain the partial dependence of variables and exam interactions.The model is robust and reliable,and extends the application of Bayesian additive regression tree in survival analysis. |