Font Size: a A A

Study On Feature Selection And Classification Algorithm For Gene Expression Data

Posted on:2016-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:S S WeiFull Text:PDF
GTID:2180330470969329Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of genomics, DNA microarray technology provides new ideas and methods to solve the life science problems. Gene expression data is achieved by DNA microarray hybridization experiment after the data pretreatment. And it is usually expressed by matrix, which has the characteristics of high dimension, small sample and unbalanced distribution. Gene expression data can provide reliable classification results for the diagnosis and treatment of diseases. Feature selection of gene expression data can reduce the dimensionality of data and the later biological analysis cost. Select the important part of genes can provide a more accurate basis for the disease prevention and diagnosis. This thesis focus on the study of feature selection and classification algorithm based on gene expression data. The main contents are as follows:(1) A model-free gene selection method is presented, which is based on the maximum mutual information(MMI). The maximum mutual information would make an ideal initialization environment in genetic algorithm for preliminary screening of genes. The method transformed the feature selection into a global optimization problem. It could remove a large amount of noise and reduce the redundant genes effectively. The selected features could be directly used for other types of classifiers, and then obtain high classification accuracy.(2) A feature selection method based on cloud platform is proposed. Combined with the characteristics of cloud computing and feature selection method, we build a simulation Hadoop cloud computing platform using 5 PCs. Then each feature data corresponding to a Map task, in which the information entropy of each respective feature set be computed. The following step is sorting the obtained mutual information in the Reduce step, transporting to the client after summarizing the screening features, and training and testing the classification accuracy using ELM. The method could speed up feature selection and reduce the time complexity.(3) An improved algorithm of regular extreme learning machine based on fish swarm optimization algorithm and Cholesky decomposition are applied in classification of gene expression data. Fish swarm optimization algorithm was used to optimize the weights of input layer and Cholesky decomposition was used on RELM output layer weights matrix to improve the speed. The improved algorithm could obtain high classification accuracy and good generalization performance.(4) An improved RELM classification method to improve the classification accuracy of gene expression data. The algorithm focuses on the improvement on the optimization of Fibonacci sequence theory which makes the improvement to the RELM hidden layer nodes and bias. Experimental results show high classification accuracy in commonly used set of tumor data.This paper mainly studies the feature selection and classification problems for gene expression data. The research algorithms were applied in Breast, Colon, Leukemia, SRBCT and other data sets. It could enhance the classification accuracy of gene expression data, and would become a valuable tool for the study of biology and life sciences.
Keywords/Search Tags:gene expression data, feature selection, classification, regularized extreme learning machine, maximum mutual information
PDF Full Text Request
Related items