| Structural characterization is crucial to performing quantitative structure-activity relationship (QSAR) studies for peptides and proteins. Major information of structure and function for peptides and proteins is contained in their amino acid sequences. Therefore, characteristics of the amino acid residues for peptides and proteins are of great significance to their QSAR study. Two kinds of amino acid descriptors, i.e. principal component score vector of hydrophobic, electronic, steric, hydrogen bond properties (VHESH) and principal component score vector of structural and topological variables (VSTPV), were extracted from principal component analysis (PCA). VHESH was derived from PCA of independent families of 50 hydrophobic properties, 23 electronic properties, 35steric properties, and 5 hydrogen bond properties, respectively, which were in total 113 physicochemical properties of 20 coded amino acids. With regard to each amino acid, VHESH1 and VHESH2 are related to hydrophobic properties, VHESH3~VHESH6 indicate electronic properties, VHESH7 and VHESH8 denote steric properties, VHESH9 and VHESH10 are hydrogen bond properties. VSTPV was derived from PCA of 85 structural and topological variables of 166 coded and non-coded amino acids. VHESH is physico-chemically interpretable and more informative in comparison with z-scales and other amino acid descriptors, and VSTPV is easy to compute, and experiment-independent can be easily expanded to other non-coded amino acids.VHESH and VSTPV were applied to study structural descriptions of several functional peptides, including angiotensin-converting enzyme inhibitors, oxytocin analogues, decapeptides binding to SH3 domain of human protein Amphiphysin-1, cationic antimicrobial peptides, and cell-penetrating peptides. Robust and predictive QSAR models were obtained by various modeling techniques and methods. The VHESH model was showed that bioactivities of angiotensin converting enzyme inhibitors could be enhanced by increasing electronic and hydrophobic properties of the 2nd residue, steric properties of the 1st residue and so on. In addition, their activities might be decreased by improving electronic properties of the 1st residue. It was inferred that activities of oxytocin analogues might be highly positive correlation with the electronic and hydrophobic properties of the 1st residue, steric and hydrogen bond contribution properties of the 3rd residue, and highly negative correlation with hydrophobic, electronic and steric properties of the 2nd residue. Diversified properties of the residues between the P-3 site and the P2 site for the decapeptide (P4P3P2P1P0P-1P-2P-3P-4P-5) may remarkably contribute to the interactions between human Amphiphysin-1 SH3 domain and the decapeptide. It can be found that electronic properties of the 3rd residue, steric properties of the 6th, 7th and 12th residues, hydrophobic properties of the 11th and 12th residues exert highly positive effects on the activities of antimicrobial peptides, and electronic of the 6th, 10th and 12th residues negatively contribute to the activities antibacterial activities. Different structural information of cell-penetrating peptides may be highly correlated to the penetrating process. Many new peptide sequences can be designed based on their structure and activities relationships in these peptides panels. The VSTPV modeling results showed similar results with VHESH models in explanation of the relationships between sequence site and bioactivities.VSTPV was applied to investigate structural description of several peptides and analogues, including bradykinin-potentiating pentapeptides, bovine lactoferricin-(17–31)-pentadecapeptide, and elastase substrate analogues. Robust and predictive QSAR models were developed using various modeling techniques and methods. The results showed that the activities of bradykinin potentiating pentapeptides were mainly related to its topological information of the 2nd and 3rd residues. It can be found that the 6th and 8th topological variables contribute significantly to bovine lactoferricin-(17-31)-pentadecapeptide bioactivities. The square and reciprocation of topological variables in the residues A and B mainly have effects on elastase substrate analogues catalytic activities.The principles and methodologies of QSAR were employed to investigate the relationship between protein structure and property or function. VHESH and VSTPV were applied to characterize amino acid sequences of proteins, including cleaved site of HIV-1 protease (HIV PR), phosphorylation site of protein, and RNA binding sites in proteins. It was inferred that HIV PR may only recognize several key properties of various sites in the octameric sequences. These diversified properties including steric properties, hydrogen bond properties, electronic properties, hydrophobic properties and topological properties of the 1st, 2nd, 4th, 5th and 6th residues and so on may be important to determine HIV PR cleavage. The physicochemical properties (VHESH) and topological properties (VSTPV) of P-3 site near the S, T, and Y sites were significant to predicting phosphorylated S, T and Y sites. Remarkable influences were derived from the steric, hydrophobic, electronic and topological properties of the 2nd, 5th, 6th sites in the motif with the 11 residues in protein sequences, and little remarkable influences were from the other sites. This point displayed that these properties may be key features for recognization of the RNA binding region.The modeling methods and related techniques are also important to the success of QSAR studies. The modeling and the pattern recognition methods, such as multiple linear regression (MLR), partial least squares (PLS), linear discriminant analysis (LDA) and SVM were discussed in this dissertation. The results showed that MLR behaved as well as other modeling methods if its application conditions were meeted. PLS can well avoid harmful effects by the multi-collinearity in modeling, and be particularly fit for the regression when the sample size is less than the number of variables. Models are robust and interpretable by LDA. As a new machine learning arithmetic, SVM can well deal with small dataset, nonlinear optimization, high-dimensional feature space, local minimization and so on. Besides, stepwise multiple regression (SMR) and genetic algorithm (GA) were used to optimize variable subsets. The results indicated that variable selection can efficiently avoid noise in the original variable set.The QSAR models were then subjected to validation and evaluation. In this dissertation, dataset were firstly divided into training and test dataset. The training dataset was utilized to establish QSAR models. Leave-one-out (LOO) cross validation (CV), leave-1/n-out (LNO) CV, leave-many-out (LMO) CV and Y random permutation test were used to perform internal validation of the QSAR models. Based on internal validation, external validation was performed by test dataset. Several evaluation functions were used to evaluate predictive power of the results of QSAR models. Besides, the error evaluation of the predictive activities of designed molecules was also fulfilled with model applicability domain in this dissertation. |