Font Size: a A A

Study On Protein - Nucleotide Binding Site Prediction Based On Sequence

Posted on:2016-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:D H ShiFull Text:PDF
GTID:2270330461479211Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Recent researches have shown that interactions between protein and nucleotide are closely related to human diseases. Binding residues of protein and nucleotide tend to be important support targets for drug design. Thus protein-nucleotide binding residues prediction is of great significance. However, the cost of relying solely on biological experiments to obtain binding residues is costly and time-consuming. Therefore predictive methods of using pattern recognition techniques are more and more important. Meanwhile protein-nucleotide binding residues prediction is a typical imbalanced learning problem as the size of minority class (binding residues) is far less than that of majority class (non-binding residues) in the entire sequence. To circumvent this problem, the method of sampling is used.Protein-nucleotide binding residues prediction is researched deeply in this thesis. The main work is as follows:(1) The feature extraction of proteins is studied. The position-specific iterative basic local alignment Search Tool (PSI-BLAST) is utilized to obtain original position-specific scoring matrix (PSSM), and the PSSM is normalized by sigmoid function. The sliding window technique is used to extract neighborhood features of the residue in the protein sequence as the feature of the residue, and neighborhood features of the residue can be represented as one image. Thus the method of sparse representation in digital image processing is applied to extract high quality PSSM feature.(2) Weighted under-sampling and clustering-based under-sampling are studied. Both two sampling methods are used to solve imbalanced learning problem. Weighted under-sampling uses K-nearest neighbor to compute a scoring matrix of training samples, and each sample’s weight is calculated based on the scoring matrix. Then it selects negative samples which are as many as positive samples according to each sample’s weight, and then selected negative samples and all positive samples are put together to form new training samples. Clustering-based under-sampling firstly uses C-means algorithm to cluster all negative training samples where C is equal to the ratio between the size of negative samples and the size of positive samples. Then it randomly selects a certain percentage of samples from each cluster. Selected negative samples and all positive samples are put together to constitute new training samples.(3) WUS-SVM and CUS-SVM prediction model are studied. WUS-SVM prediction model is formed by combining weighted under-sampling with support vector machine, while CUS-SVM prediction model is composed by clustering-based under-sampling and support vector machine. Five-fold cross-validation experiments and independent test experiments are performed on two benchmark datasets (NsitePred and BioLip) to test performance of two prediction models. Experimental results show that two prediction models use different sampling methods to solve imbalanced learning problem and improve prediction performance at a certain extent.
Keywords/Search Tags:Protein-nucleotide binding residues prediction, Weighted under-sampling, Clustering-based under-sampling, Support vector machine
PDF Full Text Request
Related items