Font Size: a A A

Construction Of Representation Techniques And Investigation On Structure-Activity Relationship For Biological Sequences

Posted on:2008-01-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:G Z LiangFull Text:PDF
GTID:1100360215490023Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Representation for biological sequences (peptides, proteins and nucleic acids) is crucial to investigate their structure-activity relationship. The structural descriptors for biological sequences should reflect the structural information tightly related to their activities, which determines the success of study on their structure-activity relationship. The structures related to the activities of biological sequences are determined by the information contained in their primary sequences. Therefore, investigation on characteristics of the primary sequences for biological sequences has great significance in study on their structure-activity relationship. The representation techniques were constructed in this dissertation considering diversified properties and activities of biological sequences, including:①Factor analysis scales of generalized amino acid information (FASGAI) derived from 516 property parameters of 20 coded amino acids;②Scores of generalized base properties (SGBP) derived from principal component analysis of a matrix of 1209 property parameters. Satisfying results demonstrated that both FASGAI and SGBP vectors have many distinct characteristics such as straightforward physicochemical meaning, high characterization competence, convenient expansibility and easy manipulation.FASGAI vectors were applied to represent the structures of several functional peptides, including bitter tasting thresholds, angiotensin-converting enzyme inhibitors, cationic antimicrobial peptides, octapeptides cleaved by HIV-1 protease (HIV PR), HLA-A*0201 restrictive T-cell epitopes and decapeptides binding to SH3 domain of human protein Amphiphysin-1. Further, favorable quantitative structure-activity relationship (QSAR) models were developed using various modeling techniques and methods. The results showed that the activities of bitter tasting dipeptides may be highly positively correlated to bulky properties of the 1st residue, bulky properties and hydrophobicity of the 2nd residue and so on, and may be highly negatively correlated toα-helix and turn propensities of the 2nd residue and so on. It can be concluded from investigation on the structural information related to the activities of angiotensin-converting enzyme inhibitors that the improvement for bulky properties and hydrophobicity of the 2nd residue, electronic properties of the 1st residue and so on may enhance their activities, and also, the improvement for compositional characteristics of the 2nd residue and so on may restrain their activities. It can be found that electronic properties of the 10th residue, bulky properties of the 7th residue, hydrophobicity of the 12th residue and electronic properties of the 3rd residue and so on may generate high positive effect on the activities of antimicrobial peptides, and also, hydrophobicity and compositional characteristics of the 6th residue and hydrophobicity of the 10th residue and so on may generate high negative effect on antibacterial activities. It can be concluded that HIV PR may recognize diversitied key properties of various sites in the octameric sequences. These diversified properties including bulky properties, secondary conformation characteristics, electronic properties and hydrophobicity of the 1st, 2nd, 4th, 5th and 6th residues and so on may be important factors in determining HIV PR cleavage or not, and particularly, bulky properties of the corresponding sites may be key features recognized by HIV PR. Investigation on properties tightly related to the affinities of HLA-A*0201 restrictive T-cell epitopes demonstrated that bulky properties and hydrophobicity of the 3rd residue, bulky properties of the 2nd residue and hydrophobicity of the 9th residue and so on may positively contribute most to the affinities, and also, hydrophobicity of the 4th residue and local flexibility of the 3rd residue and so on may negatively contribute most to the affinities. Diversified properties of the residues between the P-3 site and the P2 site (including the P-3 site and the P2 site) for the decapeptide (P-5P-4P-3P-2P-1P0P1P2P3P4) may contribute remarkable effect to the interactions between human Amphiphysin-1 SH3 domain and the decapeptide. Particularly, electronic properties of the P-3 residue may provide large positive contribution on the interactions, and hydrophobicity of the P-3 residue may provide large negative contribution on the interactions.Original prediction techniques independent of sequence homology and structure similarity were developed to predict structure-activity relationship for the proteins. FASGAI vectors were used to identify basic helix-loop-helix (bHLH) proteins,β-turns of proteins, G-protein-coupled receptors (GPCRs) and hemagglutinins of high pathogenic avian influenza virus (HPAIV). It can be concluded that remarkable influence was from the property parameters of the 5th, 8th, 9th and 13th sites in the motif with the 1st 13 residues in bHLH protein sequences, and little remarkable influence was from the property parameters of the 4th, 6th, 10th and 12th sites. This displayed that these properties may be key features recognized for the DNA binding region. Investigation by analysis of variance indicated that there may be significant difference between these property parameters of the 5th, 8th, 9th and 13th sites except local flexibility of the 8th residue and bulky properties of the 9th residue. Therefore, these properties may be utilized to identify bHLH proteins. Satisfying results of prediction forβ-turns showed that characteristics ofβ-turn residues were well represented by FASGAI vectors, meanwhile, some important information related toβ-turn residues was obtained. FASGAI-ACC-SVM methodology involving FASGAI representation, auto cross covariance (ACC) transform and support vector machine (SVM) modeling was utilized to identify GPCRs and hemagglutinins of HPAIV. The results demonstrated that FASGAI vectors are excellent representation technique for protein sequences. FSAGAI-ACC-SVM methodology has thus pointed us further into the direction of identification for GPCRs and hemagglutinins of HPAIV.SGBP vectors were employed to predict promoter strengths of E.coli promoters and identify human genome promoters. It can be concluded that properties of base position -45, -38, -28, -27, -22, -21, -5, +4, +8, +14 and +15 and so on may yield remarkable influence on promoter strengths of E.coli promoters with 68 base pairs (-49 bp~+19 bp), which has thus pointed us further into the direction of strong promoters. The results for prediction of human genome promoters (-250 bp~+50 bp) revealed that there is a wide prospect for applications of the methodology, i.e., SGBP-ACC-SVM involving SGBP representation, ACC transform and SVM modeling, in prediction of other promoters, transcription properties of mRNA and secondary structure of RNA and so on.The modeling and the pattern recognition methods, particularly partial least square (PLS), linear discriminant analysis (LDA) and SVM, were investigated. The techniques involving variable selection, parameter determination and model validation were also discussed in this dissertation. The results showed that PLS can well avoid harmful effects in modeling due to multicollinearity, and is particularly fit for regressing when the number observation is less than the number of the variables. Models developed by LDA are robust and interpreted. As a new machine learning arithmetic, SVM can well deal with small dataset, nonlinear optimization, high-dimensional feature space, local minimization and so on. These results showed that there is a wide prospect for the applications of SVM in study on structure-activity relationship for biological sequences. However, there are many issues, i.e., selection of kernel functions and corresponding parameters, leaving to be studied in detailed. Parameters of SVM were tentatively determined by response surface methodology in order to acquire reliable results. The results demonstrated that the methodology is effective for parameter determination of SVM. Besides, stepwise multiple regression, genetic algorithm and a stepwise manner were used to optimize variable subsets. The results indicated that three methods for variable selection can efficiently dismiss noise of original variables. Self-consistency, leave one out and leave group out test were used to carry out internal validations. On the base of internal validations, external validations were performed by using the predictive data set in order to ensure the validity of the models obtained.
Keywords/Search Tags:biological sequence, quantitative structure-activity relationship, structure and activity relationship, factor analysis scales of generalized amino acid information, scores of generalized base properties, peptide, protein, promoter
PDF Full Text Request
Related items