Liver cancer is one of the most common malignant tumors in the world,and it is also the most common malignant tumor of digestive system in China,the mortality rate ranks third,and the incidence rate is increasing year by year.Accurate survival prediction can not only free liver cancer patients from anxiety,but also help medical workers make rational treatment decisions.In this thesis,combined with the genetic features and clinical features,a prediction study on the survival time of patients with liver cancer is carried out.The main contents are described as follows:In this thesis,feature selection is performed on 422 initial data samples with highdimensional redundant features.Considering that gene expression features and clinical features have different biological meanings,this thesis constructs special feature selection algorithms for the two types of features respectively.For clinical features with lower dimensions,we use traditional ANOVA to screen out liver cancer-related clinical features.For gene expression features,considering their internal potential group structure,we adopt the sparse group Lasso feature selection method to screen related pathogenic genes.After data preprocessing and feature selection,the initial 110 clinical features and 20530 gene expression features are reduced to 15 and 455,respectively.After feature selection,this thesis will predict the labels of liver cancer patients,and determine whether the patients are long-term or short-term survival samples.This step is implemented using classification algorithms,and multiple algorithms such as K-Nearest Neighbor(KNN),Naive Bayes,Decision Tree,XGBoost,etc.are used for prediction research.In order to obtain the most suitable parameters,this thesis uses two methods of Bayesian optimization and grid search to adjust the model parameters.The final experimental values show that the prediction effect of the XGBoost algorithm based on Bayesian optimization is better,and the predicted F1_score is 0.85.Compared with the existing studies on liver cancer survival prediction,in the feature selection,the clinical features and gene expression features are combined as data features,in the process of feature selection,the respective characteristics of the two groups of features are fully considered,and appropriate feature selection methods are given respectively.Considering the potential group characteristics among gene features,this thesis attempts to use sparse group Lasso for feature selection and confirms its effectiveness.When predicting the survival period,we use the XGBoost algorithm to predict,and use the Bayesian optimization method(BO)to optimize the parameters of the algorithm.The final numerical comparison shows that Compared with other classification prediction algorithms and conventional grid search parameter tuning algorithms,the BO-XGBoost algorithm has the best effect on the data classification prediction problem in this thesis. |