Font Size: a A A

The Alignment-free Methods Of Protein Sequences And Their Applications

Posted on:2021-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y X ZhaoFull Text:PDF
GTID:2370330620473134Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Sequence similarity analysis is an important topic in biology.In recent years,the number of biological sequences in the database has grown rapidly,and how to interpret and extract useful information in these sequences has become particularly important.There are two general types of sequence analysis methods.The traditional alignment methods have many problems: it takes up memory,consumes time,and it is difficult to process large-scale data;for sequences with high mutation rate,frequent element recombination,and sequences with horizontal gene transfer,gene duplication and gene deletion,the accuracy rate is low;the accurate calculation of multiple sequence alignment is an NP-hard problem,and the methods of multiple sequence alignment cannot be unified.In order to solve these problems,in the past two decades,people have proposed some alignment-free methods.Compared with DNA sequences,protein sequences are more complex,so there are relatively few alignment-free methods for protein sequences.This paper focuses on the alignment-free methods of protein sequences.Based on the physicochemical properties of amino acids,two new methods for analyzing protein sequences are proposed:1.According to the six typical physicochemical properties of 20 kinds of amino acids such as the hydropathy index,the polar requirement,the chemical composition of the side chain,the isoelectric point,the average mass and the van der Waals volume,amino acids are classified and redefined with 17 symbols.And considering the frequency,average position and position variance of amino acids,we use a 51-dimensional numerical vector to describe the amino acid composition characteristics of protein sequences.On this basis,the standardized Euclidean distance is used to calculate the similarity distance between protein sequences,and further discriminant analysis and phylogenetic analysis of protein sequences are performed.Taking the influenza A virus data set of 33 species,the ND5 data set of 9 species and the ND6 data set of 8 species as examples,phylogenetic trees were established to verify the feasibility of the method.Through analysis,the obtained results are in accordance with the actual situation,which verifies the effectiveness and feasibility of the method.2.We normalized the data of six typical physicochemical properties,such as the hydropathy index,polar requirement,chemical composition of the side chain,isoelectric point,average mass and van der Waals volume of amino acids.On the basis of the new data obtained,we have defined a new index —— the average value of physicochemical properties(Apv),which not only contains six physicochemical properties,but also corresponds to each amino acid.Then,we use the cumulative distance to calculate the similarity distance between protein sequences.For a given protein sequence,we draw a 2D curve based on position and Apv of amino acids,which avoids the intersection and folding of the curve.Based on this new protein sequence 2D curve and cumulative distance,we qualitatively and quantitatively analyzed the similarity / differentiation of protein sequences.Finally,the experimental description was carried out using the ND6 data set of 8 species and the influenza A virus data set of 15 species as examples.We plotted 2D curves of protein sequences in the ND6 dataset and calculated their distance matrix,as well as constructed a phylogenetic tree of the influenza A virus dataset.The results show that the method can be easily and effectively applied to the comparison of protein sequences.
Keywords/Search Tags:protein sequences, physicochemical properties, alignment-free methods, phylogenetic tree, graphical representation
PDF Full Text Request
Related items