| Background and objectiveRadiomics has developed rapidly in the field of cancer research in recent years.It can provide decision-making support for clinical use as a non-invasive diagnostic biomarker,showing great prospects.With the extensive application of high-resolution CT and the popularity of low-dose CT(LDCT)screening for lung cancer,the detection rate of pulmonary ground-glass nodules(GGNs)has been significantly increased.Surgery is considered to be the first choice of therapy,which can provide diagnosis and treatment at the same time.Most patients have good prognosis after operation,but patients with micropapilla subtypes and partial driver gene mutations have high malignancy and poor prognosis,and need to expand the scope of resection.However,the current judgment of benign and malignant lesions,timing of surgery,and determination of resection range only rely on CT manual interpretation.In this study,the patients with pulmonary nodules undergoing surgery were selected as the research object.Based on high-dimensional data of CT imaging group,the prediction model was studied around three hot issues:accurate diagnosis of benign and malignant pulmonary nodules,high-risk subtypes of micropapillae and mutation characteristics of tumor-driven genes.The prediction ability was evaluated based on clinical information model,imaging histology model and integrated model,which provides a new strategy for accurate diagnosis of early lung cancer.Subjects and methodsStudy population:From November 2016 to July 2018,patients who underwent thoracic surgery in the Jiangsu Cancer Hospital Affiliated to Nanjing Medical University showed pulmonary nodules less than 2 cm in size.The specific inclusion of each part was as follows:(1)The first part of the study included 210 patients with pathologically diagnosed pulmonary adenocarcinoma or benign lesions(165 cases of pulmonary adenocarcinoma,45 cases of benign lesions).(2)The study of predictive model of high-risk subtypes of micropapilla was included in 67 patients with pathologically confirmed lung adenocarcinoma after operation(44 cases with micropapilla component,23 cases with micropapilla component).(3)The study of predictive model of driver mutation in the third part included 51 patients with pathologically confirmed lung adenocarcinoma after sugery(a total of 61 small pulmonary nodules were diagnosed as lung adenocarcinoma).Each tissue was subjected to next-generation sequencing to obtain tumor mutation burden(TMB)and driver mutation information such as EGFR and TP53.Image capturing and processing:All cases were scanned by GE Discovery CT750 HD gemstone energy Spectrum CT.Using 3D Slicer 4.7 software,the contour of pulmonary lesions was drawn by experienced clinicians.First,the location of the lesions was automatically drawn,and then the defect was manually modified.Subsequently,extracts 718 variables were extracted from the volume of the lesion,including shape,texture,first-order statistics and wavelet features.Data analysis and modeling:It is divided into four steps:data preprocessing,data dimensionality reduction,data modeling and model evaluation.In data preprocessing,hot decking method is selected for data filling.In data dimension reduction and modeling,based on 1000 times five-fold cross-validation,LASSO and group LASSO(GL)of penalty regression classes are used to screen the important correlative variables to ensure the stability of cross-validation results.At the same time,the final model uses the selected variables of more than 60%proportion into the logistic regression model,establishes the fine model and predicts the outcome;furthermore,other dimension reduction modelers are used.Formula comparison,including univariate screening(Uni),principal component analysis(PCA),random forest(RF),support vector machine(SVM).In the evaluation of the model,the area under curve(AUC)is the main factor,and Sensitivity(Sens),Specificity(Spec),positive predictive value(PPV),negative predictive value(NPV)andaccuracy(Acc)are supplemented to evaluate the predictive ability of the model.Moreover,we performed Delong test for the comparsion of single prediction AUC and integrated prediction AUC of clinical information and radiomics.ResultsPart Ⅰ:Study on the prediction model of benign and malignant diagnosis of pulmonary nodules.Based on GL and logistic regression methods,dimensionality reduction models are built for Clin data sets,rad data sets and combined information(rad_clin).Firstly,data pretreatment was carried out.Variable urinary microalbumin(UALB),urinary glucose diacid(UGA),glycosylated hemoglobin(GA%)and D-Dimer were excluded because of the deletion ratio of more than 40%.After 1000times of five-fold cross-validation,the results showed that the AUC and Acc of GL and Logistic regression dimensionality reduction modeling method for combinatorial histology(rad_clin)model were higher than that of single clinical information(clin)and radiology(rad),and the difference was statistically significant.Established model found that AUC=0.908,PPV=0.700,NPV=0.938,Sens=0.778,Spec=0.909,Acc=0.881.According to the Delong’s test of single prediction AUC and integrated prediction AUC of clinical information and image histology,the difference of AUC of clinical information,image histology and combinatorial histology is statistically significant(PDelongclin=0.020,PDelongrad=2.5e-5).At the same time,the study found that the integrated genomics benign and malignant prediction model can give good consideration to PPV and NPV,as well as Sens and Spec two groups of corresponding evaluation indicators.Further,the single screening frequency of more than 90%variables in GL method was extracted.It was found that the wavelet features of wavelett.HLHglcm.23,wavelet.HLLglcm.23,blood uric acid(UA),diastolic blood pressure(DBP)and smoking status(SMK)of image group had potential correlation with the good and bad outcomes.Part Ⅱ:Classification and prediction of high-risk subtypes of micropapillae in early lung adenocarcinoma.Similar to the first part,data preprocessing results showed that variables D-Dimer,UALB,UGA,GA,carbohydrate antigen 125(CA125),high-density lipoprotein cholesterol(HDL-C),low-density lipoprotein cholesterol(LDL-C)were excluded because of more than 40%deletion.After 1000times of five-fold cross-validation,the results showed that the AUC,PPV,NPV,Sens,Spec and Acc of the combinatorial histology(rad_clin)model were higher than those of single information or histology,and the difference was statistically significant.Results showed that AUC=0.965,PPV=0.808,NPV=0.951,Sens=0.913,Spec=0.886,Acc=0.896 of the combined information prediction model.According to the Delong’s test of single prediction AUC and integrated prediction AUC of clinical information and image histology,the difference of AUC of clinical information,image histology and combinatorial histology is statistically significant(PDelongclin=0.019,PDelongrad=0.033).At the same time,the study found that the integrated histology prediction model can give good consideration to PPV and NPV,as well as the corresponding evaluation indicators of Sens and Spec.Further screening of high-frequency variables according to the first part of the method revealed potential correlation between wavelet.HHLglrlm.4,serum magnesium(Mg)and high-risk subtypes of micropapillae.Part Ⅲ:Prediction model of driver mutation in early lung cancer.Data preprocessing found that variables CA125,CA153,NSEwere deleted because of the deletion ratio of more than 40%.After 1000 times of five-fold cross-validation,the results of EGFR,TP53 mutation prediction model and TMB high-low prediction model all showed that AUC,PPV,NPV,Sens,Spec and Acc of combinatorial histology(rad_clin)model were better.Results found that the combined information EGFR mutation prediction model was AUC=0.803,PPV=0.792,NPV=0.923,Sens=0.974,Spec=0.545,Acc=0.974,Spec=0.545,Acc=0.820;TP53 model AUC=0.916,PPV=0.609,NPV=0.974,Sens=0.974,Sens=0.933,Spec=0.804,Spec=0.804,Acc=0.836;Acc=0.836;AUC=0.952,PPV=0.942,PPV=0.944,NPV=0.944,NPV=0.840,NPV==0.913,Acc=0.902.According to the Delong’s test of single prediction AUC of clinical information,imaging group and integrated group prediction AUC,the difference of EGFR prediction AUC of clinical information and imaging group and combinatorial group was not statistically significant(PDelongclin=0.520,PDelongrad=0.169).The AUC difference of TP53 model was statistically significant(PDelongclin=0.011),and there was no statistically significant difference between imaging group and integration group(PDelongrad=0.257).There was no significant difference between the clinical information of TMB and the AUC predicted by integration group(PDelongclin=0.542),and the AUC difference between image group and integration group was statistically significant(PDelongclad=4.4e-5).At the same time,the study found that the integrated histology model can give good consideration to PPV and NPV,as well as the corresponding evaluation indicators of Sens and Spec.Further high-frequency variable screening revealed that blood indicators CA and CEA,as well as image features wavelet.HHfirst order.6 and wavelet.LHL first order.1 may have potential correlation with the mutation outcomes of EGFR.Blood indicators ALP and EOS,as well as image features wavelet.HLL first order.8,wavelet.LLLglszm.12,wavelet.LLLglszm.3 may be associated with TP53.The mutation outcomes have potential correlation,while the imaging features of originalglcm.4,blood sugar antigen CA199 and lymphocyte hematocrit(LYM)may have potential correlation with the outcomes of TMB.ConclusionsCombined with high-dimensional imaging group and clinical information data,GL dimensionality reduction and modeling can effectively establish predictive models of benign and malignant pulmonary nodules,high-risk subtypes of early lung adenocarcinoma micropapilla,TMB status and EGFR mutation in early lung adenocarcinoma,even with a small sample size,and the effect is better than that of single histology prediction model.In addition,by comparing the common methods such as Uni,RF and SVM,we find that the multi-group integrated model can improve the prediction effect,in which GL’s prediction effect and results can be interpreted better.Expanding sample size and adding external verification set are expected to further improve the prediction accuracy and reliability of the model.The results of this study provide a method and basis for accurate clinical diagnosis of early lung adenocarcinoma. |