Objective:Esophageal cancer is one of the most common digestive tract tumors,and it is also one of the most fatal cancers.Because there is no obvious clinical symptoms in the early stage,the overall survival rate of patients with esophageal cancer is low.Therefore,accurate prognosis prediction is one of the keys to improve the survival rate.In this paper,through high-throughput open-source data mining of esophageal cancer,we aims to screen out significant genes as potential biomarkers related to the prognosis of esophageal cancer,in order to provide theoretical basis for the prognosis and diagnosis and treatment of esophageal cancer.Methods:Based on the RNA-seq data of esophageal cancer in TCGA database and its basic data,the differential gene expression analysis and weighted gene co-expression network analysis were carried out by using DESeq2 and WGCNA package in R software to extract differential genes and hub genes,and the critical genes were selected as candidate biomarkers by survival analysis and variable selection.Finally,five prediction models,including logistic regression(LR),support vector machine(SVM),random forest(RF),decision tree(DT)and extreme Gradient Boosting(XGBoost),were used to construct the prediction model of esophageal cancer prognosis,and the predictive value of candidate markers and the performance of classification prediction model were evaluated.Results:1.A total of 39 differentially expressed genes were selected,all of which were up-regulated;Pathway / enrichment analysis showed that the differential genes had biological significance in biological process and molecular function,and had 8 significant pathways;Five genes significantly associated with survival status were reserved in survival analysis,which were ATR(HR=1.320,p=0.044)、 PPP1R15A(HR=1.354,P=0.000)、RNU6-780P(HR=1.648,P=0.036)、 BRDTP1(HR=1.014,P= 0.000)、LRRTM4(HR=2.280,P=0.000);2.WGCNA,through the construction of gene co-expression network to identify the gene modules,salmon and purple,which may be the most significantly related to esophageal cancer and find out 3844 hinge genes with the highest degree of connection with the modules.After lasso and elastic net variable selection,15 and 22 hub genes were retained respectively,and 11 genes contained in the two methods were extracted as the final hub genes;3.Based on the extracted differential genes and hub genes,a classification prediction model is constructed by using machine learning algorithm;(1)The accuracy of five prediction models based on five different genes was more than 75%,and AUC was more than 0.65(except the DT).In particular,the prediction accuracy of SVM and RF models were above80% and 85% and the AUC were 0.84 and 0.90,respectively;(2)Five prediction models were established with 15 hub genes selected by lasso and the accuracy rates were all above 83%,and the AUC were all greater than 0.61;The accuracy of the prediction models established with 22 hub genes selected by elastic net were all above 76%and AUC were all more than 0.68(except the LR and DT models);The accuracy of the prediction models established with the 11 common hub genes were all above 80% and AUC were all more than 0.78(except the DT model).In the models constructed by the three groups of genes,the prediction accuracy and AUC of SVM and RF models were more than 83%and 0.88.(3)Five prediction models were established by combining 5differential genes with 11 common hub genes.The accuracy of the models were all above 83% and AUC were all greater than 0.68(except the DT model).Similarly,the prediction accuracy and AUC of SVM and RF models were all more than 83% and 0.90.Conclusions:1.Based on bioinformatics technology and statistical methods,CUL4 B,ATR,PPP1R15 A,HNRNPDL,RUSC1-AS1,ATG10-AS1,THA3,TEX101,PLXDC1,DLEC1P1,LINC01357,GREM1,CCDC118,HINT3,YAP1 and ELOCP15 were selected as potential candidate biomarkers related to prognosis of esophageal cancer.Through literature review and gene function inquiry,it is suggested that the above genes may have the value as biomarkers for the prognosis of esophageal cancer,which is worthy of further clinical trial verification.2.The candidate biomarkers selected by the five prediction models had good prediction values,and the accuracy were all more than 83%.Among them,AUC of SVM and RF model fluctuate around 0.90,which indicated that SVM and RF model had better prediction ability.It can be seen that the good predictive ability of SVM and RF models should provide a model basis for prognosis prediction of subsequent diseases. |