Study On Protein - Nucleotide Binding Site Prediction Based On Sequence

Posted on:2016-05-03

Degree:Master

Type:Thesis

Country:China

Candidate:D H Shi

Full Text:PDF

GTID:2270330461479211

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Recent researches have shown that interactions between protein and nucleotide are closely related to human diseases. Binding residues of protein and nucleotide tend to be important support targets for drug design. Thus protein-nucleotide binding residues prediction is of great significance. However, the cost of relying solely on biological experiments to obtain binding residues is costly and time-consuming. Therefore predictive methods of using pattern recognition techniques are more and more important. Meanwhile protein-nucleotide binding residues prediction is a typical imbalanced learning problem as the size of minority class (binding residues) is far less than that of majority class (non-binding residues) in the entire sequence. To circumvent this problem, the method of sampling is used.Protein-nucleotide binding residues prediction is researched deeply in this thesis. The main work is as follows:(1) The feature extraction of proteins is studied. The position-specific iterative basic local alignment Search Tool (PSI-BLAST) is utilized to obtain original position-specific scoring matrix (PSSM), and the PSSM is normalized by sigmoid function. The sliding window technique is used to extract neighborhood features of the residue in the protein sequence as the feature of the residue, and neighborhood features of the residue can be represented as one image. Thus the method of sparse representation in digital image processing is applied to extract high quality PSSM feature.(2) Weighted under-sampling and clustering-based under-sampling are studied. Both two sampling methods are used to solve imbalanced learning problem. Weighted under-sampling uses K-nearest neighbor to compute a scoring matrix of training samples, and each sample’s weight is calculated based on the scoring matrix. Then it selects negative samples which are as many as positive samples according to each sample’s weight, and then selected negative samples and all positive samples are put together to form new training samples. Clustering-based under-sampling firstly uses C-means algorithm to cluster all negative training samples where C is equal to the ratio between the size of negative samples and the size of positive samples. Then it randomly selects a certain percentage of samples from each cluster. Selected negative samples and all positive samples are put together to constitute new training samples.(3) WUS-SVM and CUS-SVM prediction model are studied. WUS-SVM prediction model is formed by combining weighted under-sampling with support vector machine, while CUS-SVM prediction model is composed by clustering-based under-sampling and support vector machine. Five-fold cross-validation experiments and independent test experiments are performed on two benchmark datasets (NsitePred and BioLip) to test performance of two prediction models. Experimental results show that two prediction models use different sampling methods to solve imbalanced learning problem and improve prediction performance at a certain extent.

Keywords/Search Tags:

Protein-nucleotide binding residues prediction, Weighted under-sampling, Clustering-based under-sampling, Support vector machine

PDF Full Text Request

Related items

1	Sequence-based Prediction For The Protein-peptide Binding Residues
2	Prediction Of Protein - ATP Binding Sites Based On Support Vector Regression Integration
3	Predicting Protein Protein Interactions And Its Active Sites Based On Data Mining Algorithm
4	In Silicon Prediction Of DNA-binding Residues In DNA-binding Proteins
5	The Evolutionary Conservation-based Analysis And Prediction For DNA-binding Residues
6	Identification Of Calcium-binding Residues In Proteins Based On Sequence Information
7	Predicting Protein-Protein Interactions Based On Support Vector Machine And Complete Protein Sequence
8	Intelligent Prediction Of Protein Secondary Structure Based On Fuzzy Support Vector Machine
9	Characteristic Analysis And Prediction Of Protein-protein Interactions And Protein Interaction Sites
10	A Sequential Sampling Method Based On Support Vector Regression