Font Size: a A A

Parallel Feature Selection And Ensemble Classification For Gene Expression Data

Posted on:2019-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2370330566984186Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The research of bioinformatics is at a data explosion era.Recent years,the technicalprogress in genomics,metabonomics,transcriptome and proteomics,which allows biologist to have more data to analyze organism in various aspects.Abnormal life activities often lead to abnormal gene expression,it can be expressed by the gene expression data through microarray technology.The analysis of gene expression data can diagnose and identify the type of plant stress response,reduce the influence caused by these stresses before the appearance of corresponding symptoms.Since gene selection is a crucial step towards effective classification based on large scale of gene data,high performance methods for gene selection and sample classification have become increasingly important.Pathway is a collection of pathway maps that representes knowledge on the molecular interaction,reaction and relation networks.Pathway knowledge is used in gene pre-selection,each Pathway unit corresponds to a gene subset,in order to improve the interpretability in the view of biology of the gene selection result.We employ attribute reduction method based on the intersection neighborhood rough set to select significant genes in each gene subset.In the ensemble classification stage,a selective ensemble classification model which combined with Affinity Propagation is proposed.Using Affinity Propagation clustering algorithm to partition base classifiers into different clusters,then choosing these base classifiers which are exemplars of different clusters to build the final ensemble classification model.Experimental results on Arabidopsis thaliana biotic and abiotic stress response datasets show that the ensemble approach combined with Pathawy compared with the existing classical ensemble classification methods can increase classification accuracy by 12% at most,and the selected genes are related to plant stress response.To avoid removing genes in gene pre-selection process that are potentially valuable for classification,this paper removes the gene pre-selection stage,puts forward matrix calculation method of intersection neighborhood rough set and parallelized approximation set computation method to speed up the gene selection process,directly uses intersection neighborhood rough set to select significant genes.In the process of gene selection,three gene significant measures are used as heuristic information to improve the diversity among these reduced gene subsets.In addition,the selective ensemble classification method combined with Affinity Propagation clustering is improved,a novel dynamic selective ensemble modelis presented.Experimental results on three Arabidopsis thaliana biotic and abiotic stress response datasets demonstrate that the proposed method can obtain better classification performance than ensemble method with gene pre-selection,and using a variety of heuristic information can improve the diversity among the base classifiers,thus,get a better classification performance.
Keywords/Search Tags:Intersection Neighborhood Rough Set, Affinity Propagation Clustering, Selective Ensemble, Gene Expression Data, Parallel Computing
PDF Full Text Request
Related items