Font Size: a A A

The Analysis Of Tumor Gene Expression Profile Data Based On Hybrid Feature Selection Algorithm

Posted on:2021-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:Q Z ZhangFull Text:PDF
GTID:2404330614453555Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the implementation and deepening of the Human Genome Project,massive biological data has been generated,and the emergence of DNA chip technology has also been promoted.As a product of DNA chip technology,tumor gene expression profile data is a good source of data for people to study tumors.It has the characteristics of small samples,high dimensionality,high noise,and high redundancy,which easily leads to the occurrence of“dimensional disaster”and“overfitting”phenomena.These will bring great challenges to data processing.The technique of selecting the optimal feature subset from the original feature set(ie feature selection)is an effective way to solve this challenge.However,the conventional feature selection algorithm are far from being able to meet the needs.Therefore,a more efficient feature selection algorithm are particularly important.Maximum correlation minimum redundancy(mRMR)algorithm,as a feature selection algorithm commonly used to process tumor gene expression profile data,aims to find the feature subset with the greatest correlation with the category and the minimum redundancy among features in the given feature set.However,when the dimension of a given feature set is large,the algorithm is time-consuming.In view of this defect,this paper proposes an improved mRMR algorithm,namely mRMR-ChiMIC algorithm.In this algorithm,the mutual information(MI),which measures the correlation and redundancy in the mRMR algorithm,is replaced by the maximum information coefficient(MIC).At present,there are many classifications of feature selection algorithms,and each type of algorithms have their own advantages and disadvantages.As a typical filtering algorithm,mRMR-ChiMIC algorithm,like most filtering methods,often cannot automatically determine the optimal feature subset size.In order to select the optimal feature subset more efficiently,this paper combines the advantages of the filtering method and the encapsulation method,and combines the mRMR-ChiMIC algorithm with the Boruta algorithm to propose a hybrid feature selection algorithm.The algorithm is divided into two stages: first,the mRMR-ChiMIC algorithm is used to find candidate feature sets,and some irrelevant features and redundant features are quickly filtered;then the Boruta algorithm is used to select the optimal feature subset from the candidate feature set.Experimental verification was performed on three commonly used tumor gene expression profile data sets,DLBCL,Prostate,and Leukemia.The results show that the proposed hybrid feature selection algorithm has higher classification accuracy and smaller dimension of the optimal feature subset compared with mRMR and SRCMRMR algorithms.
Keywords/Search Tags:Tumor gene expression profile data, Feature selection, mRMR-ChiMIC algorithm, Boruta algorithm, Hybrid feature selection algorithm
PDF Full Text Request
Related items