Font Size: a A A

Prediction Of Protein Secondary Structure And Interaction Based On Machine Learning

Posted on:2008-12-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:M H LiFull Text:PDF
GTID:1100360245497391Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The 21st century is the era of information technology and biology technology. Bioinformatics which is the perfect combination of them take use of technology of computer science to deal with biological issues. With the completion of the Human Genome Project, biology has now entered post genomics era marked by functional genomics. As an important research branch of post genomics era, proteomics plays a most important role since all kinds of biological processes or events need the participation of proteins or their interactions.With the advancement of protein sequencing technology, x-ray crystallography and function analysis method, a huge number of protein sequential, structural and functional data are generated. This provides us a new chance to use data-driven technology, such as machine learning and data mining, to predict protein structure and function automatically. The main content of this thesis includes the following parts:Firstly, a pattern dictionary is constructed for each organism using pattern discovery algorithm. Secondary structure is considered as a kind of semantic information and assigned to each"word"in each pattern dictionary. A new prediction method based on protein pattern dictionary (DBP) is proposed for solving the protein secondary structure prediction (PSSP) problem, which uses hidden Markov model to find the best secondary structure sequence for each protein sequence. This novel DBP method achieves better prediction performance on modified Segment Overlap Measure (SOV) measure compared with traditional prediction method which base on single residue.Secondly, there is no standard training dataset for PPI prediction. Training data extracted from PPI database contain many false positives and false negatives. In order to solve this problem, we adopt the PPI reliability of von Mering for positives and assign each negative with different reliability based on subcellular information. When using these training data, we assign different weight to each sample and applied it to current PPI prediction methods, including Attraction-Repulsion (AR) model and MLE (Maximum Likelihood Estimation) method and obtain weighted AR model and weighted MLE method. It can obtain more accurate estimated parameters, i.e. protein-protein domain interaction prediction probability here. The prediction method based on sample reliability analysis achieves higher performance than original methods on Receiver Operating Characteristic (ROC) score measure.Thirdly, there are few PPI data available currently but many unlabelled data. Self-learning method can learn from labelled data and large amount of unlabelled data. Through the process of iterative learning, more and more potential PPI data are obtained from unlabelled data. This method need few initial labelled data and can achieve satisfying performance, so it has important application value in PPI prediction problem. Compared with the supervised learning method which only uses labelled data, self-learning method which uses both labelled and unlabelled data can achieve better performance.Fourthly, traditional methods take the PPI site prediction task as a residue classification problem, i.e. the class label of each residue is identified separately not considering labels of its neighbouring residues. But in fact, class labels of sequentially or spatially neighbouring residues are associated. We propose a CRF based PPI site prediction method to utilize association of neighbouring residue. Given a protein, sequence segments of surface residues are extracted and labelled by CRF as a whole. CRF-based PPI site prediction method is robust and achieves better performance than traditional classification methods.
Keywords/Search Tags:Machine learning, Protein secondary structure prediction, Protein-protein interaction prediction, Protein-protein interaction site prediction
PDF Full Text Request
Related items