Font Size: a A A

Exploration Of Multi-label Classification In Bioinformatics

Posted on:2015-11-28Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhangFull Text:PDF
GTID:2180330422489346Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
Observing what phenotype the over expression or subcellular location ofgene can cause is the basic method of investigating gene functions. Manyadvanced biotechnologies, such as RNAi, were developed to study the genephenotype. But there are still many limitations. Beside the time and cost, theknockdown of some gene may be lethal which makes the observation of otherphenotype impossible.Therefore,Some computational methods designed to helpus study these issues is essential.Taking into account such issues are multi-label classification problems, thetraditional calculation method to solve the multi-label problem, such as BR(Binary Relevance), RPC (Ranking By Pairwise Comparison), etc., are oftenbased on data decomposition, by transforming such multi-label problem intomultiple single-label problems. There is a certain value in this calculationapproach to multi-label problem. However, another novel improved algorithmwhich regards the instance and multi-label as a whole network proved to havebetter performance on the prediction accuracy and time-cost. Based on this idea,this paper proposed an improved kNN (k Nearest Neighbor) algorithm, andapply this algorithm to the phenotypes of yeast genes and predicted subcellularlocalization, which achieved good performance. The first prediction accuracyreached62.38%and66%. Compared to another three RPC-based algorithms(SMO、 RandomForest、 Bagging)in solving gene phenotype prediction,ourkNNA-based method performs great superiority on the prediction accuracy andprogram execution time.Throughout the entire thesis,our research method first build featureinformation based on GO (Gene Ontology) and KEGG (Kyoto Encyclopedia ofGenes and Genomes) enrichment scores. Then do a thorough analysis on the characteristics of gene and proteins by feature selection, including maximumrelevance and minimum redundancy and incremental feature selection methods.Finally using machine learning methods to train the training set andleave-one-out test method to predict the results of the test set. Final study showsthat our proposed algorithm based on kNNA improvements in dealing with suchmulti-label issues has unparalleled advantages, and this algorithm has so stronggeneralization ability that can typically be applied to other multi-label problems.
Keywords/Search Tags:gene phenotype, subcellular location, multi-label, K nearestneighbor algorithm, incremental feature selection
PDF Full Text Request
Related items