Font Size: a A A

Prognostic Model Of Breast Cancer Patients Based On Feature Selection

Posted on:2024-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:J L FuFull Text:PDF
GTID:2544307091491784Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
According to statistics from the World Health Organisation(WHO),breast cancer has become the most common malignant tumour in women,with more than one million women dying of breast cancer each year worldwide,and the number is still growing.Although the incidence of breast cancer in China is relatively low compared to developed countries such as Europe and the United States,the mortality rate is still on the rise and is growing at more than twice the rate.Not only is breast cancer a major health threat,but the cost of treatment can often be a heavy burden for families.In response to the many problems associated with the diagnosis and treatment of breast cancer,research is increasingly being conducted using statistical-based methods,such as the use of machine learning to build models to assist physicians in their decision making,in order to improve patient prognosis.In this thesis,the clinical diagnosis data of breast cancer patients in the SEER database from 2010-2014 were processed to obtain 61,145 raw data.The survival status of patients over a five-year period was determined based on information such as survival time and survival status,i.e.survival versus death,which is the dependent variable.In this paper,corresponding improvements and innovations were made in terms of feature selection methods and the combination with classification algorithms.Three different types of feature selection methods were used to filter the variables,namely Lasso regression in the embedding method,Copula entropy value method in the filtering method and RF_REF method in the encapsulation method.The 22 feature terms of the original data for each patient were compressed into 12 variables after Lasso regression,11 variables were filtered by t-test after calculating the Copula entropy values of the respective variables,and the number of feature subsets that resulted in optimal model effects using the RF_REF method was 15 variables,respectively.The imbalance in the original dataset was further processed using the SMOTEENN hybrid sampling algorithm to improve the predictive identification of a few classes of samples,reducing the imbalance ratio of the three datasets from the original 10.29 to 1.14,1.18 and 1.56 respectively.The three feature-selected datasets were then placed into a single model(logistic regression,XGBoost,random The three data sets were then placed into a single model(logistic regression,XGBoost,random forest)and a Stacking integrated learning model,which were trained and tuned to improve the classification prediction of the models.The results showed that: 1.For the three different classes of feature selection methods,the dataset with variable selection based on the RF_REF method in the encapsulation method combined with either a single model or the Stacking integrated learning model had better prediction results than the other two classes of feature selection methods.2.For the different machine learning models,the dataset with feature selection based on RF_REF combined with the The Stacking integrated learning model had the best prediction results among the 12 models,with the highest F1 scores,AUC values and G-mean values of 0.9204,0.9336 and0.9334 respectively,while the dataset selected based on Lasso regression and Copula entropy method combined with the Random Forest model had better prediction results.3.For the dataset obtained from the three different types of feature selection algorithms using the random forest model to obtain the importance of their variables,it was possible to obtain that tumour size as well as lymph node involvement status were the two most important feature factors affecting the prognosis of breast cancer patients.
Keywords/Search Tags:Feature Selection, Machine Learning, Unbalanced Data, Breast cancer
PDF Full Text Request
Related items