Font Size: a A A

Amino Acid Sequence Characterization, Features Selection And Its Application

Posted on:2015-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2180330470452226Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Classification and function identification of peptide/protein is an important task in the postgenome era. Unlike traditional time-consuming experimental methods, machine learning is an effective approach. Machine learning is based on the known information of samples in a certain dataset, and it includes three key aspects namely feature acquisition, feature selection and model building. It is generally accepted that amino acid sequences determine the spatial structure and function of peptides/proteins. The spatial structure of peptides/proteins is very difficult to determine, but the primary structure (i.e. amino acid sequence) is easy to obtain. We characterized each amino acid sequence with a combination of amino acid composition, geostatistical association,κ-space features, etc.(feature acquisition). We removed irrelevant and redundant features by our nonlinear feature screening workflow based on ridge regression, binary matrix shuffling filter and the worst descriptor elimination multi-roundly methods, etc.(feature selection). We used support vector machine (SVM) as the basic modeling tool (model building). SVM has many advantages in modeling such as structural risk minimization, small sample suitability, over-fitting avoidance. As an application of the machine learning approaches, we conducted studies on protein folding rate prediction, cell penetrating peptide and conotoxin superfamily classification in this paper. The results are as follows:Protein folding rates were predicted based on the ridge regression and support vector regression (SVR).96proteins with sequence length more than50amino acids were represented with geostatistical association and κ-space features. We selected25and15features by ridge regression and the worst descriptor elimination multi-roundly methods screening in turn, respectively. Our SVR regression models had correlation coefficients of0.89and0.93for the two representation methods. The nonlinear SVR interpretation system showed that the SVR models and selected features are all extremely significant. The folding rate of proteins with sequence length more than50amino acids are related to the following features:different proportion of hydropathy scale based on self-information values in the two-state model, sequence frequency, side chain angle and relative mutability, amino acids pair frequency which contains at least one aliphatic amino acids. Glycine, Alanine and Leucine, middle and long distance have larger influence. Protein folding rates were predicted based on the improved binary matrix shuffling filter and SVR. It is particularly difficult to obtain stable and efficient features for short amino acid sequences. We made a mixed dataset of115samples by merging96proteins (>=50aa) and19proteins (<50aa). We represented115proteins with amino acid composition, geostatistical association,κ-space, etc. We selected23features through improved binary matrix shuffling filter and the worst descriptor elimination multi-roundly methods screening in turn. Our SVR regression model had correlation coefficient equaled0.95. The SVR interpretation system was employed to analyze the significance of the model, the significance and single factor effect of selected features. The results showed that protein folding rate might be closely related to sequence length, associated features with the medium-and short-range, triplet residues components features, etc.Recognition of cell penetrating peptides (CPPs) and non-CPPs based on sequence characteristics was studied, which is a binary classification problem. We represented85CPPs and non-CPPs with geostatistical correlation features based on531physiochemical amino acid properties from amino acids index database. We conducted feature screening to remove irrelevant and redundant features byt-test and the worst descriptor elimination multi-roundly methods in turn. Our nonlinear support vector classification (SVC) model had an accuracy of83.53%, which was better than literature reported.Recognition of conotoxin superfamilies based on sequence characteristics was studied, which is a multiple classification problem. We represented each sequence of conotoxin family A (63), M (48), O (95), T (55) and non-conotoxin (60) with pseudo amino acid composition,κ-space, physical and chemical properties of amino acids. We conducted feature screening by binary matrix shuffling filter and the worst descriptor elimination multi-roundly methods in turn. Our nonlinear SVC model had an accuracy of92.83%, which was superior to reported accuracies. The results can be further used to guide the discovery of new conotoxins.
Keywords/Search Tags:Support vector machine, Amino acid sequence characteristics, Featurescreening, Protein folding rate, Cell penetrating peptides, Conotoxin
PDF Full Text Request
Related items