| With the development of gene sequencing technology,the gene data of breast cancer has been determined.Therefore,the Study of breast cancer data can provide a basis for early diagnosis.Breast Cancer Diagnosis time is very precious,the earlier for the patient is more conducive to treatment.If it is found in time,it is conducive to treatment.If it is not found in time or misdiagnosed,it may make the patient lose the best opportunity for treatment.However,the gene expression data of breast cancer has the characteristics of high-dimensional redundancy and strong correlation.At the same time,the data of breast cancer is distributed and unbalanced,and the sample size of patients is small.Therefore,the effective analysis and data processing of gene expression data of breast cancer has very important research significance.In view of these problems,the classification performance of breast cancer gene expression data has not reached the level of application.For Women all over the world,breast cancer has occupied the first place in malignant diseases among women.Now,the risk of breast cancer among women in our country has been increasing year by year,has had a very negative impact on women’s physical health.In China,women suffering from breast cancer is very likely to be cured,but because a lot of people lack the awareness of regular physical examination,and its early characteristics are not obvious,not easy to detect,it is easy to be ignored,so there is a very big problem in the early detection of breast cancer,and the detection is not timely,it is easy to cause the spread and metastasis of the tumor,making the treatment more difficult.It is found that the disease can be treated early,early treatment can not only improve the survival rate but also reduce the pain,so screening the characteristic genes that drive breast cancer is of great significance in the fight against breast cancer and the development of new anti-breast cancer drugs.In order to build a better ensemble learning model,this paper uses three methods to extract differentially expressed genes and selects 3053 differentially expressed genes,including 1391 up-regulated genes and 1644 down regulated genes.Then,the minimum redundancy maximum correlation algorithm is used to remove redundant features and ensure the correlation between features and categories.The first 300 features are intercepted and selected The best features selected are evaluated by themselves.Random forest algorithm is used to rank the importance of features.Under sampling method is used to process the unbalanced data,and 228 samples are obtained after processing.Modeling and analysis on the processed and filtered data.First of all,the traditional ensemble learning xgboost is analyzed.Xgboost is a very powerful ensemble learning tree.The accuracy of xgboost is 92.31% and the AUC is 0.999.Then,for the traditional single learning model SVM and neural network integration,the classification accuracy of the integrated bootstrap SVM algorithm is 97.44%,and the AUC value is 1;Using BP Ada Boost algorithm,the classification accuracy is 98.588%,AUC value is0.9875,the two integrated models have achieved very good results.In order to solve this problem,edge R,deseq2 and limma are used to select differentially expressed genes.On the basis of the data after the primary selection,the minimum redundancy maximum correlation algorithm is used for secondary selection.The minimum redundancy maximum correlation algorithm can not only reduce the dimension of different genes twice,but also ensure the minimum redundancy between genes and the maximum correlation between genes and categories.Based on the processed data,xgboost,bootstrap SVM and BP Ada Boost are used for modeling and analysis.Using the feature subset filtered by this method for modeling and analysis can not only ensure the stable operation of the prediction model,but also make the model have strong classification and prediction ability,and the prediction accuracy of the integrated classifier model is better than that of the single classifier model. |