Font Size: a A A

Predicting Protein Flexible Regions From Protein Sequences

Posted on:2018-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:S Q YangFull Text:PDF
GTID:2310330536979437Subject:Statistics
Abstract/Summary:PDF Full Text Request
The character of protein is determined by its tertiary structure.The realization of the diverse function of proteins need flexibility.It is a challenging task to predict flexible regions of protein by investigating its protein sequences,namely,predicting protein structure based on sequence information.This research is mainly based on the idea of “sequence decision structure” proposed by Anfinsen in 1960 s.Because of the great disparity of the sequencing speed and structure determination speed,the protein structure prediction has becomes an important target in the biological research field.At present,the main approaches to measuring the tertiary structure of proteins are multi-dimensioned magnetic resonance technique and X-Ray diffraction method,which are time-consuming and extremely difficult.The main method applied in the field is the bioinformatics calculation method,the one based on the protein-related data,to obtain the flexible regions according to the already-known protein structures.Predicting structure by investigating proteins sequences is often accompanied by high dimension features in machine learning methods.Therefore,the main point is to select minimum number of features to ensure the maximum accuracy of the prediction.The core of the selecting is maximum relevance and minimum redundance.The main purpose of the selecting is to build up a subset of features which possess both characteristics of features in the set target.So,in recent years,feature selecting technique is one of the most popular research in machine learning areas.The definition of flexible regions of protein verifies sharply.One of the typical definitions is based on the B-factor values obtained from X-Ray diffraction data.The larger the residue B-factor values,the greater the uncertainty of structures.That is,the residues are defined as flexible regions.And it is rigid if otherwise.Another flexible region definition is based on discrepancy of the shared sequences of the multiple proteins.When the discrepancy is sharp,the residues are defined as flexible regions,and rigid ones otherwise.First,the research proposed a new method for flexible/rigid regions of the proteins,i.e.FSID_FRP,which is based on the data set obtained by comparing structure discrepancy of the shared sequences of the multiple proteins.The method based on incremental diversity called FSID to select effective features.This method is more appropriate for small-sample studies compared to the entropy-based feature selection method.It is proved to be efficient in the prediction of the flexible/rigid regions by investigative protein sequences.Finally,the logistic regression approach is applied to integrate the selected features into a scheme to define flexible or rigid regions.Secondly,to prove its validity,FSID_FRP method is applied to a data set containing 1000 PDB sequence structures.The flexible region is defined by the normalized Bfactor values.1000 proteins are randomly divided into two groups,each containing 500 proteins,and one is the training set,the other the test set.FSID is applied to select features to build up subset for the training set,FSID_FRP is used to predict the validity.
Keywords/Search Tags:Protein flexible regions, k-spaced amino acid pairs, Feature selection, Increment of diversity, Logistic regression
PDF Full Text Request
Related items