Objective: Based on the data mining classification algorithm model and its evaluation and other related theories,this study established a breast cancer risk prediction model,aiming to screen high-risk groups and find risk factors(characteristics)that are highly correlated with breast cancer.For the diagnosis of breast cancer,traditional methods such as clinical palpation,imaging examination,magnetic resonance imaging and fine needle aspiration cytopathological examination are mainly used in the traditional diagnosis.misdiagnosed.With the help of the algorithm model covered by data mining,doctors can predict patients at risk of breast cancer before symptoms appear,thereby reducing the prevalence and reducing the misdiagnosis and missed diagnosis of patients caused by subjective factors.In addition,in reality,most breast cancer data cannot be directly used for prediction,and a series of data mining processes such as data preprocessing and feature selection are needed to achieve high-precision breast cancer risk prediction.Through preprocessing and feature selection of breast cancer data sets,logistic regression,random forest,multi-layer perceptron and BP neural network algorithms are used to construct breast cancer prediction models in turn,and the input variables and parameters of the model are used to study the effect of breast cancer prediction results.On this basis,the parameters of the model are debugged,and the algorithm models are compared and analyzed through the comprehensive evaluation mechanism.The experiment can finally obtain the optimal breast cancer risk prediction model and the risk that is strongly related to breast cancer.Factors can ultimately help doctors achieve early diagnosis of breast cancer patients,reduce their clinical missed diagnosis rate,and allow patients to receive the most timely and effective treatment.Methods:1.Use SMOTE oversampling,outlier processing,data normalization and other methods to perform data preprocessing on the data in the breast cancer dataset.2.The numerical data after dimensionality reduction was comprehensively measured by Pearson correlation coefficient,distance correlation coefficient and random forest feature importance score as the input of breast cancer prediction model.3.Use logistic regression,random forest,multilayer perceptron and BP neural network algorithms to establish different prediction models respectively.4.Use grid search,random search and learning curve methods to optimize the model and use model evaluation indicators to conduct comprehensive analysis to find the optimal model.Results:1.This study obtained a good predictor of breast cancer.2.According to the experimental results,it is found that on the SMOTE-BP model,each evaluation index in the test sample has reached 99.07%,and compared with some research results of other scholars based on this data set in recent years,it is found that its prediction effect is the best.Conclusion:1.Breast cancer data has high dimensionality and redundancy,and the model has problems such as high complexity and overfitting in the training process.In the process of research,we mainly focus on data preprocessing and feature selection,and classify the data after preprocessing and feature selection,and achieve a good classification effect.2.Medical data has the problems of high feature dimension,small sample size,strong redundancy and high correlation.The above problems are basically solved by processing according to the modeling process in this paper.3.This paper has done a lot of research on data preprocessing and feature selection.The optimization of the basic model has achieved excellent performance,especially the SMOTE-BP model has obtained a near-limit accuracy rate,which is suitable for subsequent clinical research.Can play a good reference role. |