Font Size: a A A

Research And Implementation Of Protein Subcellular Localization Prediction System Based On Ensemble Multi-label Learning

Posted on:2018-07-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:S P QiaFull Text:PDF
GTID:1310330542950820Subject:Information management and electronic commerce
Abstract/Summary:PDF Full Text Request
In the post-genome era,with the development and application of high-throughout sequencing technology,protein sequences with no location information grow fast and the number of proteins with multiple sites increases rapidly.It is becoming more and more impractical to solve the problem of protein subcellular localization only using the traditional experimental approaches.In such a situation,the protein subcellular localization prediction methods based on machine learning have developed gradually.The prediction function of this kind of approaches is basically realized through classification learning technologies.Research on prediction methods will help solving this problem.Protein subcellular locations can actually provide some valuable clues about their functions.Consequently,this research is benefit to speed up the process of discovering the sequence-structure-function relationships of proteins and can support to annotate and manage protein data.Among the proposed methods of protein subcellular localization prediction,a few of them are based on ensemble multi-label learning,and most only focus on studying the specific problems,algorithms or technologies.It lacks the exploration of general-purpose and extensible prediction systems,especially for those based on ensemble multi-label learning.Studying the underlying infrastructure of ensemble multi-label learning model,so as to support better for implementing protein subcellular localization prediction systems and provide some model reference for building systems of solving some other multi-label problems,is of both theoretical and practical significance.To promote applications of ensemble multi-label learning technologies in solving the problem of protein subcellular localization prediction,this dissertation researches the design of underlying interfaces of ensemble multi-label learning model,the building of ensemble multi-label learning framework,and the implementation of Java-based protein subcellular localization prediction system.The implemented prediction system is applied in predicting protein subcellular locations,and its functional operations and prediction performance are both analyzed in detail.The main research works are as follows:Firstly,based on review and category on machine learning algorithms,a new kind of generalized multi-label learning algorithm named “labelset-oriented learning algorithm” is formed.On this base,from the view of combination of binary classification learning,labelset-oriented learning,ensemble learning,optimization learning and object oriented technology,a universal and extensible three-layer ensemble multi-label learning model is proposed.Through designing the class hierarchy of this model,a Java-based ensemble multi-label learning foundation class library with the name EMLL.jar is distributed successfully.This library lays the foundation for extending systems to solve some multi-label learning problems.Secondly,under support of EMLL.jar,example representation and performance metrics,by implementing learning algorithms in three layers,a framework of ensemble multi-label learning system is further built and represented formally.The detailed building flow and executing procedure of this learning system are also described.In the individual learning layer,a number of existing binary classification algorithms,multi-class classification algorithms and multi-label learning algorithms are implemented directly or by some transformation.In addition,several novel labelset-oriented learning algorithms are designed utilizing Error-Correcting Output Coding(ECOC),one-to-one,one-to-many and many-to-many strategies respectively.These algorithms all support for diversity of individual classifiers.In the ensemble learning layer,some ensemble modes and ensemble strategies are designed and implemented.An ensemble mode determines a combination of individual classifiers used in ensemble operations,and an ensemble strategy describes how to make ensemble on these individual classifiers.These two elements support for generating ensemble classifiers.In the optimization learning layer,a weighted classifier optimization method based on prediction confidence and Particle Swarm Optimization(PSO)algorithm is designed.Application of this optimization method shows the prediction performance is improved.The whole learning procedure is controlled by flowing user interfaces and multi-thread technology.A flexible dynamic extension of system function and performance requirement is provided by a configuration with different properties.A Java-based library EMLLS.jar of this framework is also distributed,and thus it guarantees the feasibility for secondary development.Thirdly,for the problem of protein subcellular localization prediction,on base of the proposed three-layer learning architecture and EMLLS.jar library,and under the premise of implementation and improvement of some protein feature representations,a prediction system based on ensemble multi-label learning is extended.In this system,protein sequences can be stored into a simple formatted dataset and be accessed in a light-weight way.Some tools to handle protein features are also provided simultaneously.Moreover,take advantage of counting the occurring frequencies of different combinations of protein subcellular locations contained in a dataset,a subcellular locations correlation model,which reflects the information of both relevant labels and irrelevant labels,is explored and constructed,and then a filtering optimization approach is proposed based on this model.A test dataset is used to execute the functions of the predicting system.The results show that from reading of configuration,set of learning style and loading of dataset,to the task running of individual learning,ensemble learning,optimization learning and generation and serialization of the final classifier during the on-line learning stage,and then to functional operations in off-line prediction stage,the whole procedure runs normally.This verifies that the prediction system is workable.Finally,to measure the performance of the prediction system,two experiments on a gram-positive bacteria protein dataset and an animal protein dataset are conducted respectively.In the experiments,a comparison between feature ensemble and feature fusion is made;the individual learning,ensemble learning and optimization methods are all tested;the performances on different subcellular locations and labelset with different size are both analyzed;the influence on different learning criteria used in guiding the learning process is compared;and the effect of the proposed subcellular locations correlation model is discussed.By composite analysis of the results and comparison with some related studies,it is proved that the prediction system has better performance in predicting.In conclusion,this dissertation designs and implements an extensible three-layer ensemble multi-label learning system model,and an actual predicting system to solve the problem of protein subcellular localization is extended from this learning system.Based on the results of some experiments,the operability of functionalities and the effectiveness of performance are both verified.By adding novel features,algorithms,etc.,the performance of the system can be improved,and this provides a good experimental operating platform for studying the problem of protein subcellular localization prediction.However,there are even some weaknesses existing in this dissertation.It lacks the comprehensive exploration on protein feature representations,and the expression form and effect of the proposed subcellular locations correlation model deserves further improvement.To introduce more informative feature representations,to design more effective learning algorithms,to further improve the system and to provide some web-based applications will be the main research directions in the future.
Keywords/Search Tags:Protein subcellular localization prediction, Ensemble multi-label learning, Optimization learning, Three-layer prediction system
PDF Full Text Request
Related items