Malignant tumor is an important factor affecting the life and health of Chinese people,and lung cancer is the primary risk factor.In terms of the incidence rate and mortality of various types of cancer in China,lung cancer ranks first all the year round,and shows an increasing trend year by year in all regions of China.With the development of information science and molecular biological science,it is becoming more and more mature to study the mechanism of tumor occurrence and development based on the expression level of genes.By mining the key prognostic genes of lung adenocarcinoma,we can finally achieve a more accurate prognostic risk assessment through a group of gene combinations.The combination of early screening methods and gene risk assessment results makes the prevention of lung adenocarcinoma more timely and the treatment decision more reasonable.The main research contents of this paper include the following three parts:The first part is screening of key prognostic genes in lung adenocarcinoma.Based on the TCGA database,this paper downloaded and sorted out the gene expression data and related clinical data of lung adenocarcinoma patients,including 541 tumor tissues and 59 normal tissues adjacent to cancer.The routine screening process for key prognostic genes of lung adenocarcinoma is as follows:(1)Perform differential expression analysis on the gene expression data after pretreatment to obtain the differential gene expression directory of lung adenocarcinoma.(2)Combine the differential gene expression data of the sample with the relevant clinical data of the patient,and conduct a single factor Cox regression analysis in batches to screen out the genes significantly related to survival time and survival status.(3)Lasso regression analysis was further carried out on the genes screened by single-factor Cox regression analysis to eliminate the genes with high collinearity.(4)The best gene combination was selected through multifactor Cox stepwise regression,and the prognostic risk scoring model was constructed according to the gene combination.(5)KM survival analysis,risk triple diagram and ROC curve were used to evaluate the prognosis model in the training group and the test group respectively.It can be seen from the evaluation results that the 14 genes screened by the above methods may have redundant features,and the prediction performance of the prognosis model in the test group is poor(AUC=0.638 in the test group),and the gene combination needs to be optimized.The second part is improvement of gene screening method.In view of the large number of genes obtained by the conventional screening method,which may be due to the problem of redundant genes,this chapter has improved the gene screening method,added a random forest survival model after Lasso regression analysis,and sequenced the importance of 35 genes obtained after Lasso regression analysis,and screened the genes by setting different thresholds,Then the selected genes were further screened by multi-factor Cox stepwise regression to obtain the best gene combination,and the prognosis risk scoring model was constructed based on the gene combination.The models built under different thresholds were evaluated by KM survival curve and ROC curve.Finally,it was determined that when the threshold was 0.3,the effect of the model was better.At this time,10 genes were selected as the key prognostic genes of lung adenocarcinoma,including DEPDC7,FAM83 A,AC122134.1,CNTNAP2,CDH19,KRT81,WFDC3,AC005100.2,SIRLNT,MELTF,and the 14 genes obtained before the improvement were intersected,and 4 intersection genes were found.The third part is to explore the biological function and clinical significance of genes in lung adenocarcinoma.Taking the four intersection genes as the representative,the expression of these four intersection genes in lung adenocarcinoma and their correlation with clinical phenotypes were further analyzed to explore the biological function and clinical significance of these four intersection genes in lung adenocarcinoma.The study found that there was a significant difference in the gene expression of FAM83 A and WFDC3 between the tumor group and the normal group,and the P value obtained by Wilcoxon test was also relatively small,far less than 0.01;The difference of CNTNAP2 and AC005100.2 genes between tumor group and normal group is weak,and the corresponding P value is relatively large.In the study of correlation between gene expression and clinical phenotype of patients,it was found that there was no significant difference between patients with higher expression level of FAM83 A gene and patients with lower expression level of FAM83 A gene,except for age.There was significant difference between patients with higher expression level of FAM83 A gene and patients with lower expression level of FAM83 A gene in clinical phenotype of other studies.In terms of survival status,age and sex,there is a significant difference between patients with relatively high WFDC3 gene expression level and patients with relatively low WFDC3 gene expression level;In terms of gender,there is significant difference between patients with relatively high AC005100.2 gene expression level and patients with relatively low AC005100.2 gene expression level. |