Font Size: a A A

Research On The Grey Discrete Model Of Protein Sequence And Its Application In Drug Design

Posted on:2014-10-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:W Z LinFull Text:PDF
GTID:1261330425970498Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the implementation and development of human genome project, the research of life sciences has gone to postgenomic era, in which the number of sequences-known protein has increased explosively. Wherease, the majority of sequence-known has unkown functions and structures. It is a huge project to determine these proteins structures and functions by conducting various experiments, and it is both time-consuming and costly by relying on experimental approaches. Therefore, it is very urgent to design some computional methods for automatically predicting various attributes of proteins. Those predictors can be used to guide designing experiments to determine structures and functions of proteins, and to screen those proteins with special functions. Although some models and tools have been developed, with the avalanche of protein sequences generated in the postgenomic age, it is highly desirable for us to develop new models and to develop new theories, methods, technologies and tools.In our view, a protein is a uncertain system, in which the information of protein amino acids sequence is known whereas the overall relationship between those amino acids is unknown. On the opinion, we proposed two models expressing the features of proteins sequence:grey pseudo amino acid components (Grey-PseAAC) and grey position specific scoring matrix (Grey-PSSM). We also studied the local support machines based on the grey incidence and proposed the Grey-LSVM model. Subsequently, these models were applied to screen useful protiens for pharmaceuticals. These works involved in identifying DNA-binding protein, characterizing malarial parasite secretion protein, predicting animal subcellular locations, and classifying antimicrobial peptides functions.The main contents and innovation points in this thesis are summarized as following:(1) Researching the gray discrete models of proteins sequencesAccording to the problem of mathematical expressing proteins, I studied how to extract the inner features of proteins and proposed two gray discrete models describing proteins.As well as the grey local SVM (support vector machine) was brought out based on local learning algorithm. As we know, a protein sequence is composed of20different amino acids literal code. Because most of machine learning algorithms only process discrete vector input, it is primary to express protein sequences features with decrete vector when developing automatical machine identifying the structures and functions of proteins. Protein sequences can be viewed as a grey sytem with known sequence components and unknown the inner relation of those compenents. In view of the argument, after proteins are transformed into descrete digital sequence based on amino acids physical and chemical attributes, they can be extracted features which are defined as the parameters of GM(2,1) model built on that digital sequence. Incorporating these features and amino acid components of proteins, Grey-PseAAC model is proposed. Also, grey-PSSM model is constructed on PSSM (position specific scoring matrix) by using grey models. These innovative models can mine the nature sequence features of proteins, and expose their families characteristic and evolution information, as well as better reflect the relationship between proteins sequence and their structures or functions. Inadditionaly, a novel model, local SVM with grey relation degree is proposed.(2) Studying intelligent identification of DNA-binding proteinsIn view of the problem existing in intelligently identifying DNA-binding proteins (DBPs), I researched how to constructe effective trainning sets and hybridized prediction models and proteins representing models. And a novel model identifying DBPs was proposed. DBPs play crucial roles in various cellular processes. Fast and efficaciously identifying DNA-binding proteins is significant in the field of genome annotation. Although, many machine learning methods have been applied for identification of DNA-binding proteins, It is needly to continually improve and enhance these predictors. In view of this situation, a novel prediction tool identifying DNA-binding protein, called iDNA-Prot, is proposed, which combines a Grey-PseAAC model and random forest algorithm. In a novel restricted dataset which is built. Compared with other DBPs predictors, iDNA-Prot has shorter computing time and higher prediction rates at the same time. The free web-server is available on http://www.jci-bioinfo.cn/iDNA-Pro.(3) Investigaing characterization of malarial parasite secretion proteinFor the problem of identifying drugs targets in the designing antimalarial drugs, I investigated the methods characterizating malarial parasite secretion proteins and presented a original models identifying malarial parasite secretion proteins. The culprit for causing the disease is the parasite, which secretes an array of proteins within the host erythrocyte to facilitate its own survival. Accordingly, the secreted proteins of malaria parasite have become a logical target for drug design against malaria To short the time of developing new antimalaria drugs, one strategy is to timely identify the newly secreted proteins of malaria parasite, which can serve as potential drug or vaccine targets. Here, an automated predictor called "iSMP-Grey" is developed for identifying the secreted proteins of malaria parasite according to their sequence information alone. It obtains remarkably higher performance than other existing predictors in this area. As a web-server is freely accessible to the public at http://www.jci-bioinfo.cn/iSMP-Grey.(4) Predicting multi-label subcellular localization of proteinsAfter studying the multi-label classification model and the performance metrics on the multi-label predictiong of subcellular localizations, I constructed a novel model to predicting multi-label subcellular localization of proteins. Proteins may simultaneously exist at, or move between, two or more different subcellular location sites, hence proteins subcellular localization is a challenged issue. Subcellular mislocalization of protein can cause various disorders, so subcellular location is an important research object in drug design. By incorporating an improved GO (Gene Ontology) formulation and Grey-PSSM model, an novel model describing overall proteins features is presented. Subsequently, a subcellular locatioin predictor, called iLoc-Animal, has been developed by using the "multi-labeled learning" approach. iLoc-Animal reachs better performances in jackknife test. A web-server, iLoc-Animal is freely accessible to the public at the web-site http://www.jci-bioinfo.cn/iLoc-Animal.(5) Classifing AMPs function familiesConsidering the issue of classifying Antimicrobial peptides (AMPs) on a imbalance and multi-label data set, I studied how to resample and integrate classifier. Subseqently, I offered a novel AMPs predictor on imbalance and multi-label datasets. AMPs are an evolutionarily conserved component of the innate immune response and are found among all classes of life. According to their special functions, AMPs are generally classified into ten categories. Given a query peptide, how can we identify whether it is an AMP or non-AMP? If it is, can we identify which functional type or types it belongs to? To address these problems, which are obviously very important to develop immune drugs. Particularly, the numbers of AMPs annotated by examples have greatly gap in each function class. And an AMP may belong to two or more functional types. A mutli-label classifier dealing with imbalance dataset is developed based on resampling technology and multi-label K-nearst neighbor algorithm, where the proteins feature vector is defined by Grey-PseAAC model. The prediction results are beyond the reach of any existing methods in this area.At the end, the problems and prospects about imbalance multi-label classification are debated.
Keywords/Search Tags:Bioinformatics, Grey-PseAAC (Grey Pseudo Amino Acid Components), Grey-PSSM (grey Position Specific Scoring Matrix), imbalance multi-label classification, DNA-binding protein, secretory proteins of malaria parasite, subcellular localization
PDF Full Text Request
Related items