Colorectal cancer is one of the most common cancer worldwide and it is a serious threat to human health.In clinical practice,researchers found that stage Ⅱ CRC patients had heterogeneous prognosis.Despite the fact that National Comprehensive Cancer Network(NCCN)and European Society for Medical Oncology(ESMO)proposed high risk factors for stage Ⅱ CRC patients,some patients with those high risk factors still achieved long-term survival,while other patients without high risk factors had early recurrence or death.Therefore researchers are aiming to find more prognostic factors for stage Ⅱ CRC patients nowadays.Recently more and more studies found that imaging examinations such as computed tomography(CT)and magnetic resonance imaging(MRI)contained a wealth of information,which were widely used for disease diagnosis and prognosis prediction.The previous study of our research group found that tumor enhancement ratio(TER),which was the ratio of the CT value of the tumor under enhanced CT scan to that under CT plain scan,was an important prognostic factor for non-metastatic colon cancer patients.Colon cancer patients who had TER more than 1.75 had significant worse overall survival than those who had TER less than 1.75.Nevertheless,whether TER had prognostic value in stage Ⅱ colon cancer remained unexplored.In addition,at present there is no study comparing the importance of traditional high risk factors with TER in predicting the prognosis of Ⅱ stage colon cancer patients.Apart from radiological indicators,some studies have explored the prognostic values of specific gene expression in tumor tissue in CRC patients,yet the sample sizes of most studies were relatively small,and the results were inconsistent across studies.Futhermore,there are few recurrence prediction models specifically for stage Ⅱ CRC patients.The 21 st century is the era of big data,with the continuous accumulation of big data in medicine,machine learning models have been gradually applied to clinical practice.Random survival forest(RSF)model is an ensembled machine learning model based on survival trees,which is suitable for building prognostic models with survival follow-up data.Different from traditional COX regression model,RSF model does not need to assume the distribution of parameters in advance,and it does not require the linear relationship between variables and survival risk.In addition,it is also suitable for high dimensional data.Besides,RSF model could rank the importance of variables,which could help us screen important variables and reduce the dimension of variables.The current study is divided into two parts,both of which explore the prognostic factors of stage Ⅱ colorectal cancer patients and building predictive models based on RSF model.The first part will build RSF model with TER and traditional high risk factors to predict overall survival of stage Ⅱ colon cancer patients,and the second part will build RSF model with gene expression profiles to predict recurrence-free survival of stage Ⅱ CRC patients.1 The prognostic value of tumor enhancement ratio in stage Ⅱ colon cancer patients and the establishment of prediction model MethodsWe enrolled stage Ⅱ colon patients in our center from 2007 to 2014,and randomly divided them into the train set(60%)and the test set(40%).Tumor enhancement ratio(TER)was calculated based on abdominal enhanced computed tomography scan.Traditional high risk factors(bowel obstruction or perforation,p T4,poor tumor differentiation,vascular invasion,neural invasion and inadequate number of examined lymph nodes),age,tumor location,preoperative CEA level and overall survival follow-up data were collected.Several RSF models were built in the train set under different combinations of variables,then prognostic risk scores of patients in the train set and the test set were exported.Minimal depth(MD)of variables,an indicator of variable importance,were also calculated,and they were used for variable selection and model simplification.Time-dependent receiver operating characteristic(td ROC)curve was adopted to assess the prediction ability of model,and the threshold of prognostic risk scores was determined to maximize the Youden index.Patients in the test set were classified into high risk group and non-high group according to the threshold of prognostic risk scores,and Kaplan-Meier survival curve and the log-rank test were used to examine the survival difference between two groups.ResultsA total of 284 stage Ⅱ colon cancer patients were enrolled.The train set and the test set had 170 and 114 patients respectivcely.We first built RSF model only with traditional high risk factors,and the AUC of td ROC curves at 5 years were 0.502 in the test set.Based on the current model,overall survival had no significant difference between the predicted high risk group and non-high risk group in the test set(HR=2.42,95% CI: 0.68-8.57,p=0.167).We then added TER into the previous model,and found the mortality risk of patients increased nonlinearly when the TER increased from 1.5 to 2.5.TER also had the minimal MD value,which indicated TER ranked the first in terms of variable importance.In the test set,AUC of td ROC curves at 5 years were improved,which was 0.760.In the test set,the predicted high risk group had the tendency of worse overall survival than the predicted non-high risk group(HR=2.60,95% CI: 0.91-7.45,p=0.076).If we put TER,traditional high risk factors,age,tumor loacation and preoperative CEA into RSF model,the AUC of td ROC curves at 5 years was 0.735 in the test set.The predicted high risk group had significantly worse overall survival than the predicted non-high risk group(HR=12.8,95% CI: 3.09-53.1,p<0.001).There were seven variables(TER,age,tumor location,preoperative CEA,vascular invasion,bowel obstruction or perforation and p T4)whose MD values were less than the threshold.Simplified model was built based on these seven variables,and the AUC of td ROC curves at 5 years was 0.717 in the test set.In the test set,the predicted high risk group still had significantly worse overall survival than the predicted non-high risk group(HR=5.50,95% CI: 1.68-18.1,p=0.005).ConclusionsTER is an important prognostic factor for stage Ⅱ colon cancer patients.RSF model which inclueds TER could significantly stratify the high risk and the non-high risk stage Ⅱ colon cancer patients.2 Identification of recurrence related genes and establishment of recurrence prediction model in stage Ⅱ colorectal cancer patients MethodsStage Ⅱ colorectal cancer(CRC)gene expression microarrays in the NCBI databse were screened and raw CEL files were retrieved.Robust multiple average(RMA)algorithm was adopted to preprocess microarray data and hierarchical cluster was performed on gene expression matrix to identify outlier samples.Meta DE package was used to perform microarray meta-analysis,and stage Ⅱ CRC recurrence related genes were screened with the standard of false discovery rate(FDR)< 0.1.All samples were randomly divided into the train set(60%)and the test set(40%),then RSF model was built in the train set and recurrence risk scores of patients in the test set were outputed.Based on minimal depth(MD)of variables,importance of variables were evaluated and model simplification was performed.Time-dependent receiver operating characteristic(td ROC)curve was adopted to assess the prediction ability of model,and the threshold of recurrence risk scores was determined to maximize the Youden index.Patients in the test set were classified into high or low recurrence risk group according to the threshold of recurrence risk scores,and Kaplan-Meier survival curve and the log-rank test were used to examine the recurrence free survival difference between two groups.ResultsAfter searching and screening the NCBI database,6 gene expression microarry datasets(GSE14333,GSE17538,GSE33113,GSE39582,GSE24551,GSE92921)including 651 stage Ⅱ CRC patients met the inclusion criteria.Microarray meta-analysis identified a total of 479 stage Ⅱ CRC recurrence related genes.The train set and the test set had 390 and 261 samples respectively,then we built RSF model with 479 gene expression levels in the train set.The AUC of td ROC curve at 5 years was 0.984 in the test set.There were 179 genes whose MD values were below the threshold,and the simplififed(once)RSF model was rebuilt with 179 gene expression levels.The AUC of td ROC curve of the simplified(once)RSF model at 5 years was 0.988.Once again,there were 26 genes whose MD values were below the threshold,and the simplified(twice)RSF model was rebuilt with 26 gene expression levels.The AUC of td ROC curve of the simplified(twice)RSF model at 5 years was 0.993 in the test set.Based on the simplified(twice)RSF model,we found the predicted high recurrence risk group in the test had significantly worse recurrence free survival than the predicted low recurrence risk group(HR=1.824,95% CI: 1.079-3.084,p=0.025).ConclusionsMicroarray meta-analysis could effectively identify recurrence related genes in stage Ⅱ CRC,and building RSF model with recurrence related genes could significantly stratify the high recurrence risk and the low recurrence risk stage Ⅱ CRC patients. |