Font Size: a A A

Research On Feature Selection And Feature Construction Of High-Dimensional Data Based On Genetic Programming

Posted on:2021-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:P S LiuFull Text:PDF
GTID:2504306305471374Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
At present,the high incidence rate and mortality rate of cancer are the primary causes of death.Gene sequencing provides a technical means for early detection of abnormal genes and establishment of cancer prediction model by machine learning technology.Gene data generated by gene chip technology has the characteristics of high dimension,small sample and high noise,which increases the difficulty of data analysis.In the classification problem,the quality of features is closely related to the classification effect,especially in the application of high-dimensional gene data classification,a large number of redundant and unrelated features seriously affect the classification performance.The feature processing method can reduce the dimension of gene data,eliminate the genes that have nothing to do with pathogenicity,and improve the classification accuracy of gene data.Feature selection can reduce the number of features and improve the classification accuracy by selecting high-quality features,but sometimes the original features can not achieve the expected effect.At this time,new and more effective features can be constructed by feature construction method,and the constructed features often have better classification performance.Genetic programming(GP)can be used to deal with the task of feature construction because of its flexible representation.However,in high-dimensional classification applications,the huge search space is a challenge to the search ability of GP.Therefore,this paper focuses on the research of fusion feature selection and feature construction methods,in order to improve the classification performance of high-dimensional data such as gene data.The main research work of this paper includes the following aspects:(1)A hybrid method of feature selection and feature construction(LFSFC)is proposed.This method uses two-stage feature processing method to process the original features of gene data.Firstly,the feature selection method based on linear forward selection(LFS)is used to prune the features,and then the features are constructed based on GP,and the original features are used to construct higher-level features to predictive improve the ability of cancer classification.(2)In feature selection and feature construction methods,the fitness function based on correlation method is selected to improve the correlation between features and classes,and reduce the correlation between classes.(3)Eight high-dimensional microarray data sets were processed by the proposed method and tested on K-nearest neighbor,Naive Bayesian,C4.5,Naive Bayesian Tree,Best First Tree,Reduced Error Pruning Tree,Random Tree and Random Forest.(4)The experimental results were analyzed by Friedman significance test,and the classification performance of LFSFC was compared with two benchmark methods and three benchmark methods on different classifiers.Experimental results show that LFSFC only needs to construct 20 features,which is only about 0.02%of the original feature number on most data sets.The experimental results show that the LFSFC method can greatly reduce the feature dimension of gene date and improve the classification accuracy of cancer diseases.According to Friedman’s significance test,the proposed method LFSFC is superior to two benchmark methods and three benchmark methods in different classifiers.Through further analysis,it is concluded that the decision tree classifier is more suitable for LFSFC.
Keywords/Search Tags:feature selection, feature construction, genetic programming, classification, high dimension, gene data
PDF Full Text Request
Related items