Exploration Of Multi-label Classification In Bioinformatics

Posted on:2015-11-28

Degree:Master

Type:Thesis

Country:China

Candidate:T Zhang

Full Text:PDF

GTID:2180330422489346

Subject:Systems analysis and integration

Abstract/Summary:

PDF Full Text Request

Observing what phenotype the over expression or subcellular location ofgene can cause is the basic method of investigating gene functions. Manyadvanced biotechnologies, such as RNAi, were developed to study the genephenotype. But there are still many limitations. Beside the time and cost, theknockdown of some gene may be lethal which makes the observation of otherphenotype impossible.Thereforeï¼ŒSome computational methods designed to helpus study these issues is essential.Taking into account such issues are multi-label classification problems, thetraditional calculation method to solve the multi-label problem, such as BR(Binary Relevance), RPC (Ranking By Pairwise Comparison), etc., are oftenbased on data decomposition, by transforming such multi-label problem intomultiple single-label problems. There is a certain value in this calculationapproach to multi-label problem. However, another novel improved algorithmwhich regards the instance and multi-label as a whole network proved to havebetter performance on the prediction accuracy and time-cost. Based on this idea,this paper proposed an improved kNN (k Nearest Neighbor) algorithm, andapply this algorithm to the phenotypes of yeast genes and predicted subcellularlocalization, which achieved good performance. The first prediction accuracyreached62.38%and66%. Compared to another three RPC-based algorithms(SMOã€ RandomForestã€ Bagging)in solving gene phenotype prediction,ourkNNA-based method performs great superiority on the prediction accuracy andprogram execution time.Throughout the entire thesis,our research method first build featureinformation based on GO (Gene Ontology) and KEGG (Kyoto Encyclopedia ofGenes and Genomes) enrichment scores. Then do a thorough analysis on the characteristics of gene and proteins by feature selection, including maximumrelevance and minimum redundancy and incremental feature selection methods.Finally using machine learning methods to train the training set andleave-one-out test method to predict the results of the test set. Final study showsthat our proposed algorithm based on kNNA improvements in dealing with suchmulti-label issues has unparalleled advantages, and this algorithm has so stronggeneralization ability that can typically be applied to other multi-label problems.

Keywords/Search Tags:

gene phenotype, subcellular location, multi-label, K nearestneighbor algorithm, incremental feature selection

PDF Full Text Request

Related items

1	Predicting Subnuclear Location Of Proteins And Subcellular Location Of Ncrnas Based On Multi-Information Fusion And Multi-Label Ensemble Classifier
2	A Multi-label Classifier Based On PSSM And GO For Predicting Protein Subcellular Localization
3	Using Multi-label Learning Methods To Study Protein Subcellular Localization Prediction
4	Research On Protein Subcellular Localization Prediction Under Multi-label Setting
5	Research On Protein Subcellular Location Classification Based On Feature Learning
6	Pattern Analysis And Recognition Of Image-based Protein Subcellular Location
7	Research On Multi-site Protein Subcellular Localization Prediction Method Based On Fusion Feature And Multi-label Deep Forest Model
8	Research Of Multi-label Feature Selection Algorithms In The Form Of Nonlinear Programming
9	Predicting Multi-label Protein Subcellular Location Based On Deep Learning
10	A Multi-feature Fusion Algorithm For LncRNA Subcellular Localization Prediction Problem