Font Size: a A A

Graphical Representation And Feature Extraction Of Protein Sequences

Posted on:2019-09-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z C MuFull Text:PDF
GTID:1368330542997005Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
With the rapid development of sequencing techniques,the number of protein sequences in various kinds of biological databases is increasing in an explosive speed.The numerous newly sequenced protein sequences present an urgent need for novel computational algorithms to compare their similarities with sequences from known protein families,to predict their structures and functions.In view of the shortcomings of the traditional sequence alignment algorithms,it has become a hot issue in bioinformatics to develop alignment-free algorithms for similarity analysis of protein sequences.As an alignment-free method,graphical representation of protein sequences has received great attention because of its excellent ability in transforming biological sequences into visual zigzag curves and offering efficient numerical descriptors.In this dissertation,we mainly focus on the graphical representation and feature extraction methods of protein sequences.The main work of this paper has the following several aspects:(1)Based on a five-letter model of the 20 amino acids,we propose a new 3-D graphical representation of protein sequences.In this method,we first map the five representative letters and their pairs to the points on the underside of a right cone by using two mappings.Then,we transform the protein sequence into a five-letter sequence and map its letters to the points in 3-D space through an iterative function.Connecting the adjacent points,we obtain the graphical curve of the protein sequence.The numerical features of the protein sequence are extracted based on the leading eigenvalues of the L/L matrix corresponding to the graph curve.The innovation of this method is integrating the accumulative frequencies of adjacent amino acids in the protein sequence into the graphical representation.Experiments on two protein datasets show that the proposed method is effective.(2)Based on the 158 physicochemical properties of amino acids selected from AAindex database,we propose a novel strategy of graphical representation of protein sequences.Firstly,we select 158 physicochemical properties of amino acids from the AAindex database.Then,the 20 amino acids are arranged on the circumference of the right cone according to their physicochemical properties.Following the procedure in(1),we can obtain the graphical curves and numerical features of protein sequence.According to the 158 physicochemical properties of amino acids,a protein sequence corresponds to 158 graphical curves of different structures in this method,through which we can get more information contained in the protein sequence.Due to the high dimensionality of the feature vectors,the PCA is used to reduce the dimension of the feature matrix and the resulting vectors are regarded as the feature vectors of protein sequences to perform similarity analysis among them.Experiments on four protein datasets have fully demonstrated the effectiveness of the proposed method.(3)A novel feature extraction method of protein sequences is proposed based on the traditional CGR curve.In this method,after the traditional CGR curve is obtained,the unit circle is divided into four segments according to the four quadrants.In each segment,the pairwise distances between the points on the CGR curve are calculated,and the leading eigenvalues of distance matrices corresponding to the four segments are taken as the numerical features of the CGR curve.Compared with the traditional feature extraction methods,our method takes the distribution information of points in each segment into consideration and can describe the CGR curve in more detail.In addition,we adopt the strategy in(2)and arrange the 20 amino acids on the circumference of the unit circle according to the 158 physicochemical properties selected from the AAindex database in(2).In this method,the PCA is also used to reduce the dimension of the feature matrix.Experiments on five protein datasets have fully demonstrated the effectiveness of the proposed method.
Keywords/Search Tags:Protein sequence, Graphical representation, Feature extraction, Similarity analysis
PDF Full Text Request
Related items