Font Size: a A A

Feature Extraction And Deep Learning Method For Protein Inter-residue Interaction Prediction

Posted on:2022-04-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:1520307061473444Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Proteins are responsible for most of the biological functions in living organisms which makes them the focus of life science studies.To perform the specific functions,proteins are required to form specific structures.The structure of proteins can be represented by the coordinates of atoms of amino acid residues forming the proteins.The coordinates are often determined by several wet-lab experimental techniques,i.e.,X-ray crystallography,nuclear magnetic resonance(NMR),and Cryogenic electron microscopy(cryo-EM).However,experimental methods are considered more complex and time-consuming,compared to protein sequencing,resulting in a huge gap between protein sequence and structure data.Thus,the computational prediction of protein structures has become an important but unsolved problem in computational biology and biophysics.The major challenge in sequence-based protein structure prediction is on distanthomology modeling(or ab initio structure prediction).The predicted contact-maps can significantly improve the performance of protein structure prediction,in many forms,e.g.,latent sequential representations for protein threading,and statistical energy terms for ab initio structure prediction.Despite the success,the precision of the prediction is still limited due to the feature extraction of the current methods.To precisely predict the spatial information between long-range residues in proteins,we have been consistently developing and improving models for inter-residue interaction prediction.Here we summarize the works of the thesis as follows:(1)A new model that predicts residue-level protein contacts using inverse covariance matrix(or precision matrix)of multiple sequence alignments(MSAs)through deep residual convolutional neural network training was proposed.The approach was tested on a set of158 non-homologous proteins collected from the CASP experiments and achieved an average accuracy of 50.6% in the Top-L long-range contact prediction with L being the sequence length,which is 11.7% higher than the best of other state-of-the-art approaches ranging from coevolution coupling analysis to deep neural network training.Detailed data analyses show that the major advantage of the proposed method lies in the utilization of precision matrix that helps rule out transitional noises of contact maps compared with the previously used covariance matrix.Meanwhile,the residual network with parallel shortcut layer connections increases the learning ability of deep neural network.It was also found that the appropriate collection of MSAs can further improve the accuracy of final contactmap predictions.This work should bring an important impact on protein structure and function modeling studies in particular for the distant-and non-homology protein targets.(2)A new model based on the ensemble of a triplet coevolutionary features that can deduce protein contact-maps from discretized distance profiles by end-to-end training of deep residual neural-networks has been proposed.Compared to previous approaches,the major advantage of the method is in its ability to learn and directly fuse a triplet of coevolutionary matrices extracted from the whole-genome and metagenome databases and therefore minimize the information loss during the course of contact model training.Triplet Res was tested on a large set of 245 non-homologous proteins from CASP and CAMEO experiments and outperformed other state-of-the-art methods by at least 58.4% for the CASP 11&12 and 44.4% for the CAMEO targets in the Top-L long-range contact precision.On the 31 FM targets from the latest CASP13 challenge,the proposed method achieved the highest precision(71.6%)for the Top-L/5 long-range contact predictions.These results demonstrate a novel efficient approach to extend the power of deep convolutional networks for high accuracy medium-and long-range protein contact-map predictions starting from primary sequences,which are critical for constructing the 3D structure of proteins that lack homologous templates in the PDB library.(3)A protein structure prediction method by predicting multiple inter-residue geometry descriptors using the fusion of multi-coevolution analysis features and a multi-task deep learning model was proposed.Additional sequence-specific and post-processed positionallevel features are included,together with one-dimensional features,are fused through a deep multi-task learning structure composed of residual blocks.Each geometry prediction term is modeled marginally in the form of histogram distribution.The discrete histograms are smoothed as differential potentials and are optimized by a global gradient descent algorithm.The method was evaluated on two independent datasets containing 31 CASP13 targets and168 CAMEO targets respectively.The Top-L precisions of predicted contacts are 49.3% and57.7% on the two test sets respectively,consistently higher than other state-of-the-art methods.We also evaluate the proposed deep learning models with mean absolute error index and the results again show the superiority of the proposed method.The accuracy of those prediction tasks brings an average TM-score of 0.6 in CASP13 free modeling targets,higher than the best predictor Alpha Fold in CASP13.
Keywords/Search Tags:protein structure prediction, coevolutionary analysis, residual neural network, statistical potential, differential protein folding
PDF Full Text Request
Related items