Font Size: a A A

Robust Penalized Logistic Regression And Robust Penalized Cox Regression Based On Trimming And Their Applications In Omics Data

Posted on:2021-03-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:H W SunFull Text:PDF
GTID:1364330623475389Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objective:Previous studies have reported that labelling errors are not uncommon in omics data.Potential outliers may severely undermine the correct classification of patients and the identification of reliable biomarkers for a particular disease.It is an urgent problem to identify these misclassified samples and to select the correct features from the misclassified high-dimensional omics data.We proposed a robust penalized Logistic regression based on trimming and its theoretical properties was presented and proved.We also compared it with other methods that solve the problem of mislabeled high-dimensional omics data to facilitate the selection of appropriate methods in practice.Penalized Cox regression such as Elastic Net(EN)is effective to solve the problem of prognostic factor screening and prediction model building for high-dimensional data.However,experimental or recoding error,sample heterogeneity,cause outliers in the data.These outliers may distort the estimation of EN.If these outliers are not the result of the experimental or recoding errors,that probably means the survival times of these patients relative to its covariates have different response patterns.Identification and analysis of these outliers are likely for us to find new prognosis factor and individualized treatment for them.Methods:In the first part of this dissertation,LASSO-type and Elastic Net-type maximum trimmed likelihood estimator(MTL-LASSO and MTL-EN)were proposed.The robust properties of MTL-LASSO were presented and verified.The algorithm AR-Cstep was proposed to find the solution of MTL-LASSO and MTL-EN.Simulation studies are used in comparing the performances of MTL-LASSO and LASSO.MTL-EN was compared with other two methods,Elastic net(EN)-type penalized Logistic regression based on trimming(enetLTS)which used C-step algorithm,sparse label-noise-robust Logistic regression(Rlogreg)and “Ensemble.” which is an ensembled classification based on distinct feature selection and modelling strategies.The accuracy of biomarker selection and outlier detection of these methods was evaluated and compared by a simulation study.The four methods were applied to a triple-negative breast cancer(TNBC)RNA-seq dataset.Outliers are challenging to be detected if they located near to each other,called masking.Robust methods can solve this problem well through trimming.In the second part of this dissertation,a penalized maximum trimmed likelihood estimator for EN-type penalized Cox regression(ElaticNet-type maximum trimmed partial likelihood estimator,MPTL-EN)is proposed.An improved concentration step algorithm AR-Cstep was adopted to find the solution of MPTL-EN.Simulation studies were conducted to compare the performance of MTPL-EN and non-robust elastic networks in terms of variable selection,outlier recognition and prediction.The gene expression data of glioma patients were analyzed to illustrate its application.Results:Part I(1)From the theoretical study,the estimate of LASSO-type penalized Logistic regression tends to zero and becomes very unstable when outliers are present.The proposed MTL-LASSO method resists a large portion of outliers,i.e.,a high BDP which is closely related to the trimmed portion.The simulation showed that MTL-LASSO could resist outliers in both response and predictors,and the reweighted step ensures the performance of Rwt MTL-LASSO remains stable.(2)From the results of the simulation study that compared MTL-EN,enetLTS,Rlogreg and Ensemble,we found that,when there were only outliers in the response,Ensemble was best in variables selection.But its PSR was lower than MTL-EN.When there were outliers in both response and predictors,Ensemble was lower than MTL-EN in terms of variables selection.In terms of the mislabeled samples detected,MTL-EN performed best,with high sensitivity(Sn)and a controlled false positive rate(FPR)within 2%.The MR(misclassification rate)of MTL-EN was lower than other methods.MTL-EN also took much less time than enetLTS and Ensemble.This illustrated that the convergence of the proposed AR-Cstep algorithm used in MTL-EN was faster than C-step algorithm used in enetLTS.And AR-Cstep can find the optimal subset that does not contain outliers more effectively so that it can screen variables or identify outliers more accurately.(3)The four methods were applied to a triple negative breast cancer(TNBC)RNA-seq dataset,which included individuals with discordant labels.MTL-EN and enetLTS identified seven suspicious individuals with inconsistent labels out of 47 and 43 detected outliers,respectively,which was better than the other two methods.The mislabeled samples identified by enetLTS were all non-TNBC patients,while the mislabeled samples identified by MTL-EN also included 13 TNBC patients,including one suspicious sample with inconsistent labels.In terms of genes screened,MTL-EN screened more genes and their effect sizes were small so that these genes could be as preliminary screening genes for further study.Rlogreg,enetLTS and Ensemble screened fewer genes than MTL-EN.Ensemble and enetLTS detected more genes that were reported to be related to TNBC than Rlogreg.Part IISimulation studies showed that,in high dimensional data sets with outliers,robust MPTL-EN performs better than EN in variable selection,outliers detection,and prediction.And reweighted Rwt MTPL-EN performed better than Raw MTPL-EN.(1)when there are no outliers,the results of Rwt MTPL-EN were close to the EN.When outliers exist,the robust Rwt MTPL-EN performed better than EN in variable selection,outlier detection and prediction.Compared with the outliers who "fail too early",the outliers who "live too long" made EN perform worse.However,the outliers who "live too long" were easier to be detected by Rwt MTPL-EN,and the accuracy of Rwt MTPL-EN remained stable under datasets with symmetric or asymmetric outliers.(2)Both the performance of EN and Rwt MTL-EN decreased when censor rate increased.But the performance of Rwt MTPL-EN was always higher than that of EN.The results of Rwt MTPL-EN were better when trimmed rates of MTPL-EN was equal to or higher than the proportion of outliers.However,the performance of Rwt MTPL-EN was always higher than that of EN in any case.(3)When the outliers in the response increases,the number of variables selected by EN became less.When the outliers in the predictors also occurred,the number of variables selected by EN was far greater than the number of true non-zero variables.In both cases,the accuracy of variables selected by EN decreased.However,Rwt MTPL-EN remained stable under various conditions,indicating that Rwt MTPL-EN can resist outliers in both response and predictors.(4)Through the analysis of a glioma gene expression data,it could be seen that the genes selected by RWT MTPL-EN were different from those selected by EN,and a higher proportion of genes reported that related to glioma were identified.After removing the outliers,the prediction accuracy is higher than that of EN,and more outliers that "lived too long" were identified.Conclusion:We proposed two robust LASSO-type and EN-type penalized Logistic regression based on trimming,which were MTL-LASSO and MTL-EN.The theoretical properties of LASSO-type penalized Logistic regression and MTL-LASSO are important supplements to the statistical properties of robust penalized Logistic regression.MTL-LASSO and enetLTS can both resist the outliers in both response and predictors.If screening associated genes broadly or the prediction of response was required,then enetLTS was the best choice.In terms of the mislabeled samples detected,enetLTS performed best.If a low FDR was required to reduce the failure of subsequent experimental validation,then Ensemble was the best choice.Rwt MPTL-EN proposed established in this paper can make the variable selection more accurately than the non-robust EN when outliers exist.It can resist outliers in both the predictors and response,as well as a large percentage of outliers.Rwt MTPL-EN can more accurately identify outliers,especially in the case of outliers who "lived too long",which is of great significance since "living too long" outliers have a greater impact on EN.The AR-Cstep algorithm established in this paper makes the C-step algorithm no longer rely on separating individual contributions from the likelihood function of the model,and this improvement can promote the C-step algorithm applied to more complex models.
Keywords/Search Tags:Breakdown point, Mislabeled, LASSO, Elastic Net, Robust
PDF Full Text Request
Related items