| Cancer is one of the serious diseases that currently threaten human health,and its morbidity and mortality are increasing year by year.Prognosis is an important indicator of disease treatment,and the identification of prognostic markers for cancer can provide help for precisely targeted therapy.The lung adenocarcinoma(LUAD)and breast invasive carcinoma(BRCA)are taken as examples to investigate tumor prognosis prediction models.In this thesis,an improved tumor prognosis prediction model based on Random Survival Forest(RSF)has been presented,which uses feature selection and forward selection algorithm to identify the prognostic markers of LUAD and BRCA,and effectively improves the prognostic accuracy of LUAD and BRCA compared with traditional methods.This may provide effective support for precision oncology.The main work of this thesis is as follows:(1)Data collection and preprocessing.The molecular sequencing data and clinical data used in this thesis were downloaded from the TCGA and GEO databases and were pre-processed,including removal of clinical samples with missing values and standardized molecular data,and the clinical samples and molecular samples were matched to obtain data meeting the experimental requirements.(2)Identification of key genes for LUAD prognosis.In the LUAD training set,the RSF algorithm was first used to select features from molecular data and identify seed genes associated with LUAD survival.Cox univariate and multivariate survival analyses were then performed on the clinical data to identify the statistically significant variables.Finally,the prognostic key genes of LUAD were identified by integrating the clinical data and seed gene data based on the forward selection model.(3)Validation of key genes for LUAD prognosis.The internal and external validation sets of LUAD were completely independent.The survival risk score system was constructed by using prognostic key genes in the two validation sets,and the evaluation metrics such as HR,p value and C index were calculated.The experimental results show that compared with the traditional Cox model and the use of individual seed genes,the proposed method effectively improves the prediction accuracy of LUAD prognosis(internal validation set:C-index=0.656;external validation set:C-index=0.672),and the model is also superior to the other five existing prediction models.(4)Research on risk prediction of BRCA.In BRCA training set,SMOTE(Synthetic Minority Over-sampling Technology)algorithm was first utilized to solve the problem of data imbalance.The RSF model was then used to select the features of the standardized molecular data to identify seed genes associated with BRCA survival.Then,Cox univariate and multivariate survival analyses were performed on the processed clinical data to identify the statistically significant variables.Next,the prognostic key genes of BRCA were identified by integrating the clinical data and seed gene data based on the forward selection model.Finally,the BRCA original validation set was utilized to evaluate the model.The experimental results show that compared with the original BRCA training set data,the proposed method effectively improves the prediction accuracy of BRCA prognosis(C-index increased from 0.667 to 0.702). |