Font Size: a A A

The Research On Feature Selection Algorithms Based On Information Theory

Posted on:2019-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:W J RenFull Text:PDF
GTID:2428330566984197Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
High-dimensional and complex biological data contain a large amount of important information which is closely related to life and health,and the information can be effectively discovered with the aid of data mining technology.Data mining is increasingly integrated with bioinformatics,and feature selection technologies are the most widely used.Excavating features related to life and health from big biological data and revealing the inherent laws of living systems is very important.Since organism is a complex system,the physiological and pathological changes are usually influenced by molecule interactions.Therefore,the dependencies among variables should not be ignored in biological data analysis.Hence,an Interaction Gain – Recursive Feature Elimination(IG-RFE)method is proposed to measure the feature importance based on the relevance between feature and class label and the interaction among features.The relevance between feature and class label is calculated by the symmetrical uncertainty,and the interaction among features is evaluated by the interaction gain.Based on the symmetrical uncertainty and the interaction gain,less important features are removed from the current feature set in each loop,thus the evaluation of feature weights tends to be more and more accurate with the elimination of noisy features.The experiments on eight public datasets showed that the accuracy and stability of IG-RFE is superior to MIFS,mRMR,CMIM and ReliefF.Therefore,by combining the relevance between feature and class label and the interaction among features could better measure the feature importance in the biological data analysis.The occurrence and development of a disease is the result of many coordinated molecules.Therefore,to define the synergistic network and identify module biomarkers is of great significance for the diagnosis and prognosis of many diseases.Hence,an Interaction Gain – Network(IG-Net)method is proposed to discover the synergistic module biomarkers.IG-Net employs the interaction gain to detect synergism between features and construct the synergistic network.Then search for feature modules with the largest joint mutual information in the synergistic network based on greedy strategy.Simulated annealing algorithm is referenced to prevent the searching process from falling into a local optimum: current feature module will receive adjacent features with a certain probability even if it could not improve joint mutual information.Finally,a sequence of feature modules ranked according to joint mutual information and average mutual information are obtained.The experimental results on public data sets showed that IG-Net can effectively identify information-rich and synergistic feature modules,and the performance of IG-Net is superior to MIFS,m RMR,CMIM and ReliefF.Both IG-RFE and IG-Net select features by considering interaction or synergism among features.The experimental results on public data sets demonstrated the effectiveness of the two methods.Therefore,taking feature interaction or feature synergism into consideration make differences in biological data analysis.
Keywords/Search Tags:Bioinformatics, Data Mining, Feature Selection, Feature Interaction
PDF Full Text Request
Related items