Font Size: a A A

Research On Feature Selection Algorithm Based On Breast Cancer Gene Expression Data

Posted on:2020-11-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2404330599956765Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,with the continual increase in morbidity and mortality,cancer has become one of the important factors affecting human health.Breast cancer,the most common malignant tumor in women's diseases,seriously jeopardizes women's health.Early diagnosis and early treatment have become the key means of treating breast cancer under the background of current medical equipment and medical technology.With the continuous development of machine learning technology,machine learning algorithms can detect the risk of cancer in a simpler and more effective way,thereby reducing the incidence of cancer.Relying on the background of machine learning,gene detection methods are also constantly developing.The generation and development of tumors are closely related to genes.Gene expression data is used for early diagnosis of breast cancer,which is of great significance for the discovery and identification of breast cancer.Using machine learning related algorithms to select and classify gene expression data to predict cancer incidence has become a hot issue in the field of cancer classification.Gene expression data is characterized by high dimensions,whereas high-dimensional feature sets contain a large number of cancer-independent data.Therefore,it is necessary to use feature selection methods to screen out a set of feature genes related to breast cancer.Traditional feature selection methods,such as chi-square test,decision tree and information gain,often have the disadvantages of ineffective feature redundancy removal,high time complexity and over-fitting.Therefore,how to choose theappropriate feature selection method has become the key issue of this paper.Based on the research of breast cancer feature gene selection algorithm by domestic and foreign scholars,three new feature selection algorithms Ave-mRMR,RFFS-GS and SVM-RFE-PO algorithm are proposed.Firstly,the subsets of feature gene sets related to breast cancer were screened by using the above three feature selection methods,and the selected optimal feature subsets were used for support vector machine classifier and Bootstrap-SVM integrated classifier for breast cancer classification,so as to obtain the most effective feature selection algorithm.The main research work of this paper includes the following four aspects:(1)Based on the traditional feature selection algorithm mRMR algorithm,the Ave-mRMR algorithm including the idea of mutual information standardization is proposed.The algorithm ensures the maximum correlation between features and categories while removing redundant features,and balances the correlation and redundancy between features.In this paper,the above two feature selection algorithms were used to select the feature genes on the DNA microarray dataset and the RNA-seq gene expression dataset,and the selected optimal feature gene subsets were used for breast cancer classification.The experimental results show that the improved feature selection algorithm Ave-mRMR can more accurately select genes related to breast cancer.(2)Based on the research of random forest based feature selection algorithm RFFS,an improved RFFS-GS algorithm model based on parameter optimization is proposed.The model applies the grid optimization algorithm to the parameter optimization process of the RFFS algorithm.First,the grid optimization algorithm is used to optimize the parameters,and then the obtained optimal parameter values are applied to the random forest construction process in the RFFS algorithm.In the end,a more accurate and effective feature selection algorithm RFFS-GS is obtained.In this experiment,the feature genes were selected by using the above two feature selection algorithms on the DNA microarray dataset and the RNA-seq gene expression dataset,and then the obtained optimal feature gene subsets were used for breast cancer classification.The results show that the improved RFFS-GS algorithm enables more efficient feature gene selection.(3)Based on the support vector machine based recursive feature elimination algorithm SVM-RFE,a parameter optimization method SVM-RFE-PO is proposed,which is based on support vector machine for recursive feature elimination andparameter optimization algorithm.Through the application of grid search algorithm(GS),particle swarm optimization algorithm(PSO)and genetic algorithm(GA)to search for optimal parameter values in feature selection process,three new feature selection methods are proposed: support vector machine based recursive feature elimination and grid search algorithm(SVM-RFE-GS),support vector machine based recursive feature elimination and particle swarm optimization algorithm(SVM-RFE-PSO),and support vector machine-based recursive feature elimination and genetic algorithm(SVM-RFE-GA).We call the above three algorithms SVM-RFE-PO algorithm.This experiment first selects the feature genes by using the above four feature selection algorithms on the DNA microarray dataset and the RNA-seq gene expression dataset,and then uses the obtained subset of the best feature genes for breast cancer classification.The experimental results show that the SVM-RFE-PSO algorithm can select feature genes more effectively.(4)A Bootstrap-SVM integrated classifier model based on Bagging algorithm is proposed.The model mainly obtains different subsets of the training set through the Bootstrap sampling method,and obtains different base classifiers through the difference of the training set.Then,a certain strategy is used to integrate several weak classifiers generated after training into a strong classifier Bootstrap-SVM.Experiments show that the classification accuracy of the integrated classifier Bootstrap-SVM is higher than that of the single classifier SVM.In this paper,based on different parameter optimization methods,the existing feature selection algorithms are improved.According to the classification effects of the feature subsets obtained by different feature selection models,the improved feature selection algorithm has more efficient feature selection ability.Therefore,parameter optimization for feature selection algorithms is a very significant topic.
Keywords/Search Tags:Gene expression data, Ave-mRMR algorithm, RFFS-GS algorithm, SVM-RFE-PO algorithm, Bootstrap-SVM model
PDF Full Text Request
Related items