Font Size: a A A

Protein Remote Homology Detection And DNA-binding Protein Identification

Posted on:2018-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:S Y WangFull Text:PDF
GTID:2310330533469250Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Proteins are the necessary basic materials in cell,carrying out the duties specified by the information encoded in genes.In the post-gene era,as the development of protein sequencing techniques,the number of protein sequences is growing explosively.Therefore,the identification of structure and function of proteins in biology is of great significance.In this dissertation,we researched protein remote homology detection in protein structure and DNA-binding protein identification in protein function.The protein remote homology refers to the identification of the homologous protein that are similar in structure but sharing low sequence identity.The purpose of the protein remote homology is to classify an unknown protein sequences into a specific superfamily of proteins.DNA-binding proteins play an important role in living organisms,gene transcription,recombination,repair,replication and so on.We studied these two problems by extracting the features from protein primary sequences and combining the machine learning for improving the predictive performance.The detailed research contents are as follows:The protein remote homology detection is one of the fundamental problems in protein structure research.In this dissertation,we proposed the Pseudo Dimer Composition(PDC)method for improving the original pseudo-amino acid composition that is lack of sequence-order information.Firstly,the original protein sequences are transformed into corresponding pseudo protein sequence by using the frequency profile,and evolutionary information in profile was embedded into pseudo protein sequences.Then the fixed-length vectors were generated from pseudo protein sequences by using the PDC method.We combined SVM and ensemble strategy to predict protein superfamily classification.The approach was to linearly integrate with ROC value of each family as its weight.The experimental results on benchmark showed our method has an AUC of 0.927 and an AUC50 of 0.749,which indicated that our method performs better than other methods in this field.The identification of DNA binding protein plays an important role in protein function research.In this dissertation,the frequency profile and the pseudo-amino acids are firstly employed to transform proteins into fix-length feature vectors,and then the feature vectors were fed into SVM classifiers to construct the predictor.Many predictors were embedded by using bagging strategy,and the performance was further improved.The experimental results on independent dataset showed the accuracy of our predictor is 76.56%,AUC is 0.8392.In addition,the biological properties of amino acid in the recognition process were analyzed according to the weights of different features of SVM.For the problem of losing information of pseudo-amino acid composition,we proposed a method based on the combination of Kmer and ACC.The Kmer method can extract the information of amino acid distance pairs,and the auto cross covariance method can extract the physicochemical information of amino acids.By optimizing the combination of parameters,we can further improve the accuracy of DNA-binding protein identification.The experimental results on independent test showed that the prediction accuracy of our method is 75.16%,which performs better than other related methods.In this dissertation,a new method based on affinity propagation clustering algorithm and reduced alphabet is proposed for DNA-binding protein identification.Then 656 basic classifiers were clustered into 10 categories,and the highest accuracy in each category was selected.At last,the 10 selected classifiers were combined by using linear ensemble strategy.The experimental results on independent dataset showed that our method achieved the accuracy value of 83.87%.
Keywords/Search Tags:protein remote homology, DNA-binding protein, ensemble learning
PDF Full Text Request
Related items