Font Size: a A A

Study On Multi-label Prediction For Several Types Of Protein Classification

Posted on:2015-10-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:C HuangFull Text:PDF
GTID:1228330422488721Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
The study of proteins is not only the important content of traditional biology, but also thecore of bioinformatics research. Owing to the fact that the proteins’ types are diverse, andtheir roles are complex, the functional studies of proteins are difficult but meaningful tasks.This paper, based on systematical study on previous researches, proposes expanded andfurther deep studies on several important classification problems about proteins.Prediction of protein subcellular location is a meaningful task which attracted muchattention in recent years. Chapter2focused on this field of research. Most of the proteinsubcellular location predictors developed in the literature can only deal with thesingle-location proteins. However, some proteins may belong to two or even more subcellularlocations. It is important to develop predictors which will be able to deal with multiplexproteins, because these proteins have special biological meaning, especially in the researchfield of basic biological research and drug discovery. Considering the circumstance that thenumber of methods dealing with multiplex proteins is limited, it is meaningful to exploresome new methods which can predict subcellular location of proteins with both single andmultiple sites. Different methods of feature extraction and different models of predictalgorithms using on different benchmark datasets may receive some general results. In thischapter, two different feature extraction methods and two different models of neural networkswere performed on three benchmark datasets of different kinds of proteins, i.e. datasetsconstructed specially for Gram-positive bacterial proteins, plant proteins and virus proteins.These benchmark datasets have different number of location sites,and their scales are low.The application result shows that RBF neural network has apparently superiorities against BPneural network on these datasets no matter which type of feature extraction is chosen.Thischapter also concerns for large-scale multi-label data dataset forecasting problems. In thischapter, we adopted a multi-label based KNN algorithm instead of BP neural network as apredictive algorithm with RBF neural network to fulfill prediction task together. These twoprediction algorithms and two different feature extraction methods were combined intomultiple predictive models which were applied on three different types of benchmark datasetssuch as the human protein, eukaryotic proteins and Gram-negative bacterial proteins. The benchmark datasets are more larger, and have different number of location sites. Applicationresults show that, in general, multi-label KNN algorithm and RBF algorithm achieved a closerprediction performance, and some combinations in a dataset have their own advantages.Generally, forecast combination(PSSM+RBF) showed the best prediction performance.Predicting membrane protein type is another interesting task, because this kind ofinformation is very useful to explain the function of membrane protein. Due to the explosionof new protein sequences discovered nowadays, it is highly desired to develop efficientcomputation tools for quickly and accurately predicting the membrane type for a givenprotein sequence. Even though several membrane predictors have been developed, they canonly deal with the membrane proteins which belong to single membrane type. The fact is thatthere are membrane proteins belonging to two or more than two types. To solve this problem,an approach system for predicting membrane protein sequences with single or multiple typesis proposed in chapter3of this paper,and encouraging prediction results were also received.For a given particular protein,it is very meaningful to know which quaternary structuretype it belongs to because this kind of knowledge is highly correlated with its function. Aseries of prediction methods about protein quaternary structural type have been proposed, butthey only focus on proteins with single quaternary attribute. The fact is that there are anumber of protein sequences which are annotated with more than one quaternary attribute,which indicate that developing a computational tool which can predict proteins with bothsingle and multiple quaternary attributes is a meaningful task. In chapter4of this paper, anew multi-label based computational system has been established by combining the approachof pseudo amino acid composition and the algorithm of ET-KNN. The result indicates that itis a powerful tool for an initial study.Prediction of protein sub-subcellular localization is a further refinement of the study ofprotein function. The focus, based on the prediction of protein subcellular localization, is tostudy the further subdivision of the organelles. The object of the study in this area is mainlyconcentrated on mitochondria, chloroplasts and nucleus. Even though several methods wereproposed to predict sub-organelle location for each type of organelle datasets (prediction ofprotein sub-subcellular localization), they can only deal with the proteins which exist in singlesub-organelle location. However, according to this study, there are proteins belonging to morethan one functional location in each organelle. Unfortunately, this phenomenon has not beengiven sufficient attention, and there is still no effort has ever been made in this very topic.Study of such phenomena has important significance for further understanding the function ofproteins. Therefore, it is meaningful and challenging to make effort in how to deal with theproteins with multiple location sites instead of just excluding them. To solve this problem, inthis paper, several datasets with different levels of homology for each organelle were established, and several multi-label prediction models which are based on several differentmethods of feature extraction to predict the location of protein were selected according to thescale and complexity of these datasets. This is the first effort to predict the proteins withmultiple sub-subcellular locations, and series of valuable results are received. Taking theresult of chloroplast for example, The overall jackknife success rates achieved by the bestcombination (features+classifier) on three datasets with different levels of homology were89.08%,81.29%and71.11%. The results of these models prove that they are efficientmethods for the prediction of protein sub-subcellular with both single and multiple locationsand might be applied as useful and efficient assistant tools for the prediction ofsub-subcellular localizations.It is common that most recent studies applied a certain model or arithmetic to onlyspecific dataset. Considering the avalanche of biological data generated in the post-genomicage, the limitations of such approach become clear when applying the same model to differentdatasets. In chapter6of this paper, a multifunctional ensemble classifier combining severalindividual classifiers is proposed. Each of the classifier was trained in different parametersystem which was extracted from a training system. The final outputs were combined througha weighted voting system. The multifunctional ensemble classifier was conducted on severalstrictly constructed biological dataset. Based on the testing result from three different types ofbiological dataset, this new predictor can deal with more sweeping range of biological data,and achieves more efficient and robust results in comparison with other published methodstentatively.
Keywords/Search Tags:KNN, Feature extraction, Pseudo amino acid composition, Multi-label, Multifunction, Protein subcellular localization, Neural network, Protein quaternary structure, Membrane protein type
PDF Full Text Request
Related items