Font Size: a A A

The Study On DNA-binding Protein Prediction Based On Sequence Information

Posted on:2015-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ZhouFull Text:PDF
GTID:2310330422491820Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of protein sequencing technique, people areunderstanding the structure and function of protein more and more deeply. However,the rapid increasing protein sequences have presented a huge challenge to theautomatical prediction of protein structure and function. DNA-binding proteins is aclass of proteins that can bind to DNA to produce a combound and is indispenesablefor every cell activity. The automatic prediction of DNA-binding proteins canrapidly discover the DNA-binding proteins and premote the rapid identification oftarget proteins of drug and the development of computer aid drug design (CADD).The prediction of DNA-binding proteins can be divided into two categories,they are the prediction of DNA-binding proteins of unknown structure and theprediction of DNA-binding proteins of known structure, respectively. Although thepredictive methods by adopting structure information can achieve more predictingperformance, the structure of the vast majority of proteins are unknown. So thiskinds of methods can't be applied in high-throughput protein function prediction.This thesis focuses on the prediction of DNA-binding proteins of unknown structure,which is the prediction of DNA-binding proteins based on sequence information.This thesis studies the prediction of DNA-binding proteins from the following tworespects: protein representation method and machine learning.The major work of this study includes: firstly, the application of theTop-n-gram based protein represantation method on the prediction of DNA-bindingproteins is tudied. In this part, the specific precedures used to convert the frequencyprofile of unequal dimension into feature of equal dimension is firstly studied; thenthe discriminant weight for every feature produced by Top-n-gram based proteinrepresantation method is calculated and finally the import features is analysed.Secondly, an protein represantation method based on Position-Specific ScoringMatrix Distance Transformation (PSSM-DT) is proposed and its application in theprediction of DNA-binding proteins is studied. The PSSM-DT based proteinrepresantion method can not only improve the predictive performance, but alsoprovide excellent biological explanation. The experimental results showed that the combination of the two protein represantation methods can further improve theperformance for prediction of DNA-binding proteins. Finally, the DNA-bindingprotein predictive method by combining the two protein represantation methodswith ensemble learning is studied. Expermental results shows that this methodachieved the best performance. Experimantal analysis shows that the two proteinrepresantation method proposed by this thesis are complementary and combiningthem with ensmeble learning can construct a out-performing DNA-binding proteinpredictive method.
Keywords/Search Tags:DNA-binding protein, Top-n-gram method, ensemble learning, PSSM
PDF Full Text Request
Related items