Font Size: a A A

Research On Diagnosis And Prediction Model Of Breast Cancer Based On Ensemble Learning

Posted on:2024-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:J Y GuanFull Text:PDF
GTID:2544307106486254Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
With the gradual development of social economy and the continuous improvement of the material standard of living,our country has built a well-off society in an all-round way,people pay more and more attention to health.As a malignant tumor with a high incidence in the world,breast cancer has a great impact on women’s health.Traditional breast cancer detection methods are based on the "gold standard" method,which consists of three tests: clinical examination,radiographic imaging and pathological examination.While this traditional approach is based on regression processes to indicate the presence of cancer,new machine learning techniques and algorithms are designed based on models.With the update and iteration of science and technology and the continuous development of statistical disciplines,the diagnosis of breast cancer is no longer on the surface of traditional data,but more importantly,it is necessary to dig the hidden information behind the data,find more statistical rules from the data and provide valuable references for doctors’ diagnosis.In this thesis,based on the UCI database Wisconsin Breast Cancer Patients(Diagnosis)data set,an ensemble learning model was established to predict the diagnosis results of breast cancer patients to the greatest extent.Firstly,through exploratory data analysis,the distribution of variables in the data set and the degree of influence on dependent variables were described.Finally,the outliers in the samples were eliminated and it was considered that the average area of characteristic nuclei,the average concave number of nuclei,the average concave degree of nuclei and the average perimeter of nuclei had great influence on dependent variables.The mean fractal dimension of characteristic nuclei,mean symmetry of nuclei and mean smoothness of cancer cells had little effect on the dependent variables.In feature selection,m RMR algorithm in filter algorithm,Relief F algorithm in package algorithm and Lasso algorithm in embedded algorithm are used respectively to select 30 features in the data set.The three methods selected 8,9 and 8 variables with the top weight respectively,and the degree of repetition of the variables selected by the three methods was low,so it was considered that the methods had comparative value.On the basis of variable determination,random forest,Adaboost and XGBoost models in integrated learning are used in combination with three variable selection methods,m RMR,Relief F and Lasso.A predictive model of breast cancer diagnosis was established through training data set,and the model was used to predict on test data set.The final results show that,among the internal models,the random forest model combined with Relief F feature selection algorithm has the best effect,and its accuracy rate reaches 0.9035.The Adaboost model combined with Lasso variable selection algorithm has the best effect,with an accuracy of 0.9298.The XGBoost model combined with Relief F variable selection algorithm has the best effect,with an accuracy of 0.9473.After comparing the prediction effect of different models,the comparison between the three models combined with the optimal features is made.Conclusions: If medical personnel and researchers pay more attention to the accuracy of diagnosis,ReliefXGboost model should be used to obtain better results;If the need is to find patients with malignant tumors as much as possible,then using the Relief-Random Forest model is a better choice.Finally,the AUC values of the three models were compared comprehensively,and the Relief-XGboost model with comprehensive level was considered to be the best,whose AUC values were 0.964,0.928 and 0.928,respectively.Through the empirical analysis of this thesis.It not only provides data support for breast cancer related medical researchers to make diagnosis,but also expands the use of methods in medical diagnosis research.Moreover,it enriches the research on the combination of life science and integrated learning with limited resources.
Keywords/Search Tags:Breast cancer prediction, Variable selection, Integrated learning, Data mining
PDF Full Text Request
Related items