Font Size: a A A

Machine Learning And Statistical Based Methods For Protein Structural Features Prediction

Posted on:2014-12-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:R T M u r t a d a K h a l a Full Text:PDF
GTID:1260330401456224Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Proteins are an important class of biological macromolecules present in all organisms, they play a key role in almost all biological processes. P rotein consists of amino acids which are connected in sequence that form the primary structure of protein. The basic elements of the secondary stru cture of protein are alpha-helices, beta-sheets, coils, and turns. A turn is a structural motif where the alpha-atoms of two residues are separated by fe w (usually1to5) peptide bonds, and the distance between them is less th an7A°, while the corresponding residues do not form a regular secondary structure element such as an alpha-helix or beta-sheet. Different turns are classified according to the separation between the two end residues. The end residues are separated by four peptide bonds in alpha-turns, three pep tide bonds in beta-turns, two peptide bonds in gamma-turns, one bond in delta-turns, and five bonds in pi-turns. Beta-turns are the most common f orm of turn structure found in protein because approximately,25%of ami no acids in protein structures are located in them. Therefore, development of accurate beta-turns prediction methods is valuable. There are several methods are designed for beta-turns prediction, however the prediction qu ality is still challenge and there is substantial room for improvements.In this thesis we study the integration of machine learning and statistical based methods into protein secondary structure, and beta-turns prediction. We considered using statistical dimensional reduction method with Artificial Neural Network (ANN) to increase its efficiency in protein secondary structure prediction and produce results that are comparable to the state of the art methods. Also we formulized Logistic Regression (LR) model and utilized Kernel Logistic Regression (KLR) for protein’s beta-turns prediction. Both of those techniques are commonly not found with the domain of protein secondary structure and beta-turns prediction. And finally, we provided an elegant hybrid approach that combine both Support Vector Machines (SVMs) and LR into a powerful framework, which is found to work well for protein’s beta-turns prediction.Since training the ANN is a time consuming process, especially when the number of features is very large. We started with using Principal Component Analysis (PCA), which is a mathematical procedure to transform correlated variables into ordered uncorrelated variables with ANN for protein secondary structure prediction. We show that PCA can be used to reduce the computational overhead of ANN trained with Scaled Conjugate Gradient (SCG) algorithm for protein secondary structure prediction. Conjugate Gradient (CG) algorithm is a search method that can be used to minimize network output error in conjugate directions. The ANN is trained to be able to recognize amino acid patterns that are located in known secondary structures and to distinguish these patterns from other patterns not located in these structures. The input layer of the ANN encodes a moving window in the amino acid sequence and prediction is made for the central residue in the window. Single sequences information is used as input features to the ANN. In single sequences information each amino acid at each window position is encoded by a vector of20inputs, one for each possible amino acid type at that position. In each vector the input corresponding to the amino acid type at that window position is set to1and all other inputs are set to0. Position Specific Scoring Matrices (PSSMs) are also considered as input features. In PSSM each row is corresponding to amino acid residues. The input vectors to the ANN based on specific window size is taken form the row of the PSSM that correspond to the specific amino acid at the specific window position.Secondly, we presented LR and KLR methods for beta-turns prediction. We first start by using LR model with different features set. Then we used KLR, which is often not found on predicting protein secondary structures and beta-turns due to its computational demand. However, unlike SVMs and ANNs, KLR yields a-posterior probabilities based on a maximum likelihood argument, that is besides predicting class labels; KLR provides interpretation about this labeling. We show that Fixed-Size KLR (FS-KLR), which is a fast implementation of KLR suited for large dataset, can be used to predict beta-turns in protein in an efficient and effective way, and it yields results that are comparable to the state-of-the-art methods. Finally, we propose a hybrid approach that combines SVMs and LR for beta-turns prediction. We utilize PSSMs and Predicted Secondary Structure (PSS) as features. We used k-means clustering algorithm in this hybrid approach to divide the non-beta-turns into three subsets, and then each subset is combined with the beta-turn class to create a sub training set. Three SVMs classifiers are used, each for one sub training set. The results of the SVMs are aggregated by using a LR model, which will enable us to take advantages of the statistical modeling theory to find the optimal weights for each SVM. Fractional polynomials, which are powers terms that can take on both positive and negative integer values and fractional values that best fit the data are used to select the final LR model. By adopting this hybrid approach, we can avoid the difficulty of imbalanced data and also have outputs with probability. Our simulation studies show that this hybrid approach achieves performance that is the best among other beta-turns prediction methods that are based on PSSMs and secondary structure information. And it also achieves good performance when considering shape strings as additional features.
Keywords/Search Tags:protein secondary structure, beta-turns, artificial neuralnetworks, logistic regression, kernel logistic regression, support vectormachines, fractional polynomials
PDF Full Text Request
Related items