Font Size: a A A

Research On Machine Learning Based Protein Class And Protein-ligand Interaction Prediction

Posted on:2018-11-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:L N ZhaFull Text:PDF
GTID:1310330512984935Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of biological information technology in the post-genomic era,the focus of life science research has shifted from decoding genome sequences to annotating gene functions.According to the central dogma of molecular biology,only after genes that carry genetic information are translated into proteins can they perform a variety of physiological functions in organisms.As the high-throughput sequencing technologies are becoming more mature,the number of sequence-identified proteins is growing exponentially.By contrast,the grow rate of the number of function-identified proteins is relatively lagging.The gap between the number of sequence-known and function-known proteins is becoming larger.Protein function prediction has become an important and challenging research subject,which contributes to exploring not only the origin of life and genetic variation,but also the pathogenesis of major diseases on cell and molecular levels,thereby providing important theoretical support for disease diagnosis,prevention and drug development.Traditional experimental techniques are expensive,time-consuming,and cannot be carried out on a large scale.Therefore,it would be urgent to develop reliable,economic,and high-throughput computational methods for rapidly and effectively predicting protein functions.Protein class prediction and protein-ligand interaction prediction are two important research branches of protein function prediction.Based on machine learning,this dissertation makes an in-depth study on protein class prediction and protein-ligand interaction prediction.The detailed contents are summarized as follows.(1)Binary classification problems in protein class predictionBacteriophage virion proteins and non-virion proteins have distinct biological functions.Accurate identification of bacteriophage virion proteins from bacteriophage protein sequences contributes to understanding the complex virulence mechanism and developing antibacterial drugs.Without comprehensive sequence descriptors,existing methods constructed prediction models through individual classifiers.Based on stacking,a new bacteriophage virion protein prediction method is proposed.This method extracts information on amino acid composition,position,order,distribution,physicochemical property and evolution from protein sequences.Different random forest prediction models are constructed based on the different feature extraction strategies,respectively.The prediction results of random forest prediction models are integrated by means of logistic regression algorithm.When evaluated on the independent testing dataset,the proposed method performs better than previous studies.Therefore,this method is an effective tool for predicting bacteriophage virion proteins.Antioxidant proteins perform significant functions in maintaining oxidation/antioxidation balance and have potential therapies for some diseases in organisms.Accurate identification of antioxidant proteins can provide theoretical basis for revealing physiological mechanism of oxidation/antioxidation balance and developing antioxidant drugs.In view of shortcomings of existing methods,an ensemble learning method for antioxidant protein prediction is proposed based on multiple feature extraction strategies and classifier selection strategy.In order to further improve the prediction performance,Relief combined with IFS(Incremental Feature Selection)method is adopted to eliminate redundancy and irrelevant features.On the independent testing dataset,the proposed method achieves a more balanced sensitivity and specificity,which is significantly superior to those of the existing methods.Anti-angiogenic peptides can inhibit the angiogenesis process and contribute to the therapies of angiogenesis-related diseases.Accurate identification of anti-angiogenic peptides can provide significant clues to understand the angiogenesis mechanism and develop antineoplastic therapies.The existing method was developed based on an individual classifier and did not employ the feature selection technique to obtain the discriminative features.An ensemble classifier for anti-angiogenic peptide prediction is constructed by selecting a classifier with a high sensitivity and another classifier with a high specificity.To reduce the computational complexity and improve the prediction quality,the Relief-IFS method is adopted to search more relevant features with the target.The comparison results between the ensemble classifier and existing method on the same benchmark dataset show that the proposed ensemble classifier is effective in predicting anti-angiogenic peptides.(2)Multi-classification problems in protein class predictionDifferent types of J-proteins perform distinct functions in the disease development.Accurate identification of types of J-proteins will provide significant clues to reveal the functions of J-proteins in the biological processes and contribute to understanding the pathogenesis of diseases.Without comprehensive sequence descriptors,the existing method did not deal with the class imbalance problem.Drawing on ensemble learning,a prediction model for types of J-protein prediction is constructed based on the under-sampling method.This prediction model effectively deals with the class imbalance problem.Compared with the existing method,this ensemble classifier obtains a more balanced sensitivity and specificity.Conotoxins targeting different ion channels play distinct physiological functions and therapeutic potentials in organisms.Accurate identification of types of ion channel-targeted conotoxins will contribute to deciphering the physiological mechanism and pharmacological properties of conotoxins.Previous studies merely extracted composition-based features from protein sequences and did not deal with the class imbalance problem.By extracting information on amino acid composition,distribution,order,physicochemical property,and secondary structure,a new prediction model for the types of ion channel-targeted conotoxin is proposed.This model employs SMOTE(Synthetic Minority Oversampling Technique)to increase the number of the minority class.On the independent testing dataset,the prediction accuracies of the proposed model for different types of ion channel-targeted conotoxins are all better than those of the existing methods,which verifies the powerful prediction ability.(3)Protein-ligand interaction predictionProtein-aptamer interaction plays a variety of physiological functions in organisms and has therapeutic potentials.Rapidly and effectively predicting protein-aptamer interaction is significant to give insight into understanding mechanisms of protein-aptamer interaction and developing aptamer-based therapies.Previous studywas based on an individual classifier,merely extracted composition-based features from sequences,and did not deal with the class imbalance problem.A new ensemble method is presented to predict protein-aptamer interaction by combining multiple feature extraction strategies.The proposed method achieves a more balanced sensitivity and specificity on the training dataset by 10-fold cross validation,which indicates that this method can solve the class imbalance problem effectively.To evaluate the prediction quality objectively,the proposed method and the existing method are tested and compared on an independent testing dataset.Encouragingly,the proposed method achieves a better sensitivity and Youden's index than those of previous study.
Keywords/Search Tags:Protein class prediction, Protein-ligand interaction prediction, Machine learning, Feature extraction, Class imbalance problem
PDF Full Text Request
Related items