According to the "2020 Global Cancer Report",the global breast cancer incidence and mortality rate in 2020 ranked first in the cancer data list.Breast cancer has replaced lung cancer as the world’s largest cancer.In order to give more attention to the increasing prevalence of breast cancer,in March 2021,the World Health Organization launched the Breast Cancer Initiative.It is hoped that by 2040,the annual mortality rate of breast cancer can be reduced by 2.5%.For breast cancer,the key to improving the prognosis and increasing the survival rate is early detection.In addition,breast cancer patients can greatly extend the survival time and improve the quality of life through a good prognostic analysis.Therefore,this research mainly focuses on the early diagnosis and prognostic analysis of breast cancer disease.It applies machine learning,statistical analysis,and data mining technology,and analyzes existing clinical data by constructing models to achieve a high accuracy rate for breast cancer disease.Prevalence analysis and judgment and prognostic survival rate prediction.Aiming at the early diagnosis of breast cancer,this paper proposes a set of breast cancer clinical data analysis and processing model based on improved random forest optimization algorithm.Firstly,perform collinearity interpretation analysis and model selection on the acquired breast cancer clinical data,and use factor analysis method to perform variable interpretation and variable collinearity analysis on breast cancer data set.In the model selection,by reproducing multiple algorithm models,comparing the prediction accuracy of the models,analyzing the effect of different models on the breast cancer data set to establish a diagnostic classifier,and selecting the best random forest algorithm model as breast cancer Diagnostic model.Secondly,use the Select Kbest algorithm for data screening to reduce the complex interrelationships between features,and use genetic algorithms to optimize the parameter selection of the random forest classifier model,and use the genetic idea to evaluate the parameter selection to the greatest extent Improve the accuracy of classification.In this study,the selected metrics are the precision value,recall rate,F1 score,and AUC value of the algorithm model.Experiments have shown that the accuracy of the above metrics is improved after the method is optimized.The proposed method It provides a new idea for clinical data processing and disease prediction with strong collinearity.Aiming at the prognosis analysis of breast cancer disease,this paper proposes a set of breast cancer triple survival-related gene screening methods based on the COX proportional hazard regression model and the Kaplan-Meier model.First,obtain a data set related to breast cancer from the c Bio Portal database,including gene expression,copy number and clinical data.After preprocessing the downloaded data,the data set is filtered through the proposed method to find the survival of breast cancer patients.Nine breast cancer-related genes that are significantly related.Then use these nine breast cancer-related genes as feature samples,and use decision tree algorithm,logistic regression algorithm,neural network machine learning algorithm and XGboost algorithm to build a predictive model to predict the patient ’ s two-year survival rate.The results show that the four categories are different The prediction of the two-year survival rate of breast cancer patients in the machine learning algorithm model has good prediction results.Nine genes are reliable in assessing the relevance of patients’ survival.It can be seen that the nine breast cancer-related genes after triple screening have a strong correlation with the survival time of breast cancer patients,and can accurately predict the prognostic survival time of breast cancer patients. |