Font Size: a A A

Research On Feature Selection For Gene Expression Data

Posted on:2012-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:Q P ZhuFull Text:PDF
GTID:2120330338494805Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
The gene microarray technology is a new molecular biological technology with great influence. Gene microarray makes it feasible to obtain large number of gene expression data so that people understand gene expression patterns from the molecular level and study biological phenomena in the micro perspective. But the dataset has some traits, such as small samples, high dimensionality, big noise, large number of redundant genes, uneven distribution. It is an important preprocessing technique to choose an appropriate method to reduce the feature dimension and choose the representative genes.Gene expression data is small, uneven distribution, noisy and does not meet the normal distribution. This paper proposes two estimators based on theory of robust statistics. The two statistics do not only take the information of overall sample into account, but also avoid over-dependence on the normal model assumptions. The experiments show that it obtain a better classification accuracy when these estimators are applied to the T-statistic algorithm to select differentially expressed genes.Support vector machine is a classification technology based on structural risk minimization. L-J algorithm is feature selection algorithm based on research SVM classification.According to K-L transform theory, any vector can be expressed as the sum of component in orthogonal space. Therefore, the improved algorithm use separating hyperplane of the gradient vector's components in each axis instead of the angle calculation between gradient vector and each axis.The method can obtain the same effort with L-J algorithm.Gene expression data contains a lot of redundancy genes.A large number of redundant genes affects the classification results. The paper proposed a method mapping each gene into feature space's vector based on correlation coefficients theory and cluster the vector according to certain rules.After that step, We Select a representative subset from vector composition and compose feature subset.Experiment show that the algorithm reduces the feature dimension and improve the classification results.Genetic algorithm is an intelligent search algorithm for large data sets. This paper proposes an improved genetic algorithm applied to feature selection based on full consideration to the characteristics of gene expression data.The algorithm mix genetic algorithm, immune algorithm, filtering, heuristic method and support vector machine classification. The obtained feature subset through this algorithm has stronger classification ability.
Keywords/Search Tags:microarray gene dataset, feature selection, Robust statistic, support vector machine (SVM), Clustering, Genetic algorithm(GA)
PDF Full Text Request
Related items