Font Size: a A A

Research On The Gene Selection Based On Neighborhood Mutual Information

Posted on:2015-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:T H XuFull Text:PDF
GTID:2298330431990596Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Rough set theory is an available tool for data analysis, which can deal with uncertain, imprecise, incomplete, inconsistent data. In recent years, rough set theory has been applied to the field of bioinformatics and has achieved some good results in tumor classification feature gene selection. However, the rough set theory is defined based on equivalent relation, which only can deal with discrete data. It is necessary to discretize numerical features when we use the rough set theory to deal with them. But the discretization will lead to the loss of important information and decrease the classification accuracy. However, the neighborhood rough set theory is able to deal with numerical data well, and can be directly applied to tumor feature gene selection, which can save a lot of data preprocessing time and avoid the loss of information to some extent. It can obtain the feature gene subsets with the classification ability of the original data set to a large extent.In this paper, some feature gene selection algorithms are proposed, in which the neighborhood rough set theory is applied to the methods of feature gene selection, and the neighborhood mutual information is used as a measure of correlation among features. The main innovations of this paper are listed as follows:(1) Aiming at the problem that it is necessary to discretize numerical features when we use the rough set theory to deal with them, and the discretization will lose some important information and decrease the classification accuracy, in this paper, a novel improved Relief feature selection algorithm (NRFE_Relief) based on neighborhood mutual information is proposed, which is applied to sort gene and create candidate feature gene subsets; the neighborhood rough set reduction model is employed to reduct the candidate feature gene subsets, which can process the numerical features directly without discretizing and obtain the relevant feature gene subsets; the relevant feature subsets are tested by particle swarm optimization method, which tries to determine optimal or suboptimum feature subsets; on the basis of these theories, a novel feature gene selection algorithm based on neighborhood rough set and particle swarm optimization is proposed. The experimental results show that this algorithm can select cancer informative genes promptly and effectively, and obtain better classification results.(2) In order to avoid the influence caused by tumor unrelated gene and noise. In this paper, a new method based on neighborhood mutual information and Self-organizing map for feature gene selection is proposed. This method utilizes NRFE_Relief algorithm to sequence genes and generate candidate feature gene subsets; the Self-organizing map algorithm is improved, in which neighborhood mutual information takes the place of Euclid distance to measure the correlation between attributes, and then this improved Self-organizing map algorithm is used to cluster the candidate feature gene subsets; on the basis of the attribute importance coefficient which is defined based on neighborhood mutual information, the representative gene from each category is selected to constitute feature subset. The experimental results show that the method can select cancer informative genes promptly and effectively, and it improves the classification precision.(3) In order to overcome the hard partition of the K-means algorithm and make up the defect of the Fuzzy C means clustering algorithm which easily leads to the local convergence and poor clustering results, and better deal with numeric genetic data, in this paper, the cohesion degree of the neighborhood of an attribute and coupling degree between neighborhoods of attributes are defined based on neighborhood mutual information, so a new initialization method of clustering centers is proposed based on these theories. Then the Fuzzy C means algorithm is improved based on this new initialization method of clustering centers, and the improved Fuzzy C means algorithm is utilized to cluster the gene data. On the basis of the attribute importance coefficient of neighborhood mutual information, the gene which has largest attribute importance coefficient is selected from each category as representative gene. On the basis of these theories, a novel feature gene selection algorithm based on neighborhood rough set and Fuzzy C means algorithm is proposed. The experimental results show that this algorithm can select feature gene subsets promptly and effectively.
Keywords/Search Tags:Neighborhood mutual infonnation, Relief algorithm, Self-organizing map, Particleswarm optimization, Fuzzy C means algorithm, Feature gene selection
PDF Full Text Request
Related items