Font Size: a A A

Research On Virus Evolution And Classification Based On Statistical Features Of Genetic Sequences

Posted on:2021-09-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:L L HeFull Text:PDF
GTID:1480306542496844Subject:Statistics
Abstract/Summary:PDF Full Text Request
Virus is a kind of biology which is widely distributed in nature.Its structure is simple and it can have a huge impact on human activities.With the rapid development of sequencing technology,researchers have sequenced gene and protein sequences of many kinds of viruses.Using sequence analysis method to study the classification,origin and evolution of virus is the key step to understand their functions.Sequence analysis has very important practical value for the prevention,diagnosis,vaccine research and development of infectious diseases caused by virus.The traditional multiple sequence alignment(MSA)methods are effective for sequence analysis.In generally,these methods can accurately construct the biological evolution relationship.However,the mutation rate of virus sequences is often very high,especially for single stranded RNA viruses,which leads to inaccurate results of multiple sequence alignment.The time required for multiple sequence alignment is very high.The drawback makes MSA unsuitable for large-scale and long virus classification and evolutionary analysis.Based on the statistical characteristics of biological sequences,in this research we propose three alignment-free fast virus classification and evolution analysis methods.In this thesis,based on three important physicochemical properties of amino acids along with their distribution,we propose a 24 dimensional feature vector for protein sequences.Using this vector,we are able to classify and analyze virus proteins.The results of multiple virus datasets show that the new tool can quickly and accurately classify viral proteins and infer viral phylogeny.HIV-1 is the most common and pathogenic strain of human immunodeficiency virus consisting of many subtypes.To study the difference among HIV-1 subtypes in infection,diagnosis and drug design,it is important to identify HIV-1 subtypes from clinical HIV-1 samples.In this work,we propose an effective numeric representation called Subsequence Natural Vector(SNV)to encode HIV-1sequences.SNV is based on distribution of nucleotides in HIV-1 viral sequences.It not only computes the number of nucleotides,but also describes the position and variance of nucleotides in viruses.Using the representation,we introduce an improved linear discriminant analysis method to classify HIV-1 viruses.To validate our alignment-free method,6902 complete genomes and 11,668 pol gene sequences of HIV-1 subtypes are collected from the up-to-date Los Alamos HIV database.The results show that SNV outperforms the three popular methods,Kameris,Comet and REGA and achieve almost100% sensitivity and specificity.In addition,our method also consumes much less time.SNV method can also correctly construct phylogenetic tree of HIV-1.Finally,from the statistical characteristics of gene sequences,we propose a fast aligment-free method of virus classification named Positional Correlation Natural Vector(PCNV).This new vector includes features of average positional and covariance of nucleotides and can convert a DNA sequence into an 18 dimensional vector.The results of multiple virus datasets show that PCNV is fast and accurate for inferring the phylogeny of organisms.Compared with Bayesian inference based on alignment and two alignment-free methods,PCNV method has advantages in accuracy and speed.
Keywords/Search Tags:virus classification, biological evolution, SNV, PCNV, feature vector
PDF Full Text Request
Related items