Study On Virus Genetic Diversity And Host Prediction | | Posted on:2024-01-22 | Degree:Doctor | Type:Dissertation | | Country:China | Candidate:C Y Lu | Full Text:PDF | | GTID:1520307334978819 | Subject:Biology | | Abstract/Summary: | PDF Full Text Request | | With the widespread application of sequencing technology,more and more new viruses have been discovered.However,it is still unknown how many kinds of viruses exist on Earth,and most of the discovered viruses have not been accurately classified or annotated for their biological features,such as the most important feature-the host of the virus.Systematic studies of viral genetic diversity and the development of tools to predict the classification and hosts of new viruses can provide a foundation for exploring the biological characteristics,ecology,and transmission of viruses.This study systematically investigated viral diversity and host prediction based on bioinformatics methods,and the main research results are as follows:1)The diversity space of global viral genetics was estimated for the first time.By simulating the process of virus discovery and accumulation in viral metagenomic studies,and fitting a power function to the growth trends of viral genomes and protein clusters,it was estimated that there are at least 890 million viral operational taxonomic units(v OTUs)and 1.8billion viral protein clusters(v PCs)on Earth.However,currently only approximately 1-2% of v OTUs and v PCs have been discovered,and an additional 56 million samples would be needed to capture half of the global viral genetic diversity.Therefore,more virome sequencing projects are needed to explore viral genetic diversity.Using similar methods,this study analyzed and predicted the growth trends and total abundance of viruses in different ecosystems,and found that aquatic ecosystems contain more viral genetic diversity,with higher genetic diversity identified in each sample and faster viral abundance growth.Therefore,aquatic ecosystems should be prioritized in future virus discovery research.Additionally,this study analyzed the distribution of viruses at different taxonomic levels in different ecosystems,as well as the viral composition in different ecosystems,and found that some viruses have preferences for certain ecosystems,such as Megaviricetes showing a higher preference for aquatic ecosystems.The estimates of viral genetic diversity growth trends and total abundance in different environments in this study can help guide targeted sampling and sequencing in different ecosystems in subsequent research.2)The structural diversity of existing virus reference genomes was systematically investigated.The study focused on three aspects: firstly,the preference of transcriptional direction for different types of viral genes was analyzed systematically based on the transcriptional direction of neighboring genes.It was found that RNA viruses showed a higher consistency in transcriptional direction(mostly co-directional transcription),while DNA viruses exhibited more diverse transcriptional directions with a higher proportion of divergent and convergent transcription,which were consistent with the transcriptional direction preferences in their respective host types.Secondly,the relationship between relative transcriptional direction of gene pairs and their biological functions was studied,revealing that gene pairs with divergent and co-directional transcription were more likely to encode promoters,and had stronger protein functional correlations.Thirdly,the relationship between gene distance on the genome and protein functional correlations was investigated,and it was found that the degree of protein functional correlation(probability of protein-protein interactions and shortest distance in protein-protein interaction network)decreased with the increase of gene distance(number of intervening genes),and leading to a decrease in the probability of forming structural complexes.This study systematically analyzed the structural characteristics of viral genomes and the coding rules of genes,and explored the relationship between genome structure and protein functions,providing a theoretical foundation for virus classification and evolution research.3)A Markov model-based classification method for prokaryotic virus families,called Prokaryotic virus Classification Predictor(PCP),is proposed.This method uses all protein sequences of each virus family to construct a Markov model,and calculates similarity scores between the protein composition of the target virus and the Markov model of each virus family for prediction.Finally,virus families are classified hierarchically based on the ranking of similarity scores,with statistical confidence evaluation.In practical tests,PCP outperforms Protein Tag methods and classical BLAST-based methods overall,especially in predicting novel and short-length virus genomes.In conclusion,this study provides a new and effective method for prokaryotic virus classification,which will contribute to improved virus classification prediction for viromics research.4)A novel method for predicting human viruses based on graph convolutional neural networks(GCN)is proposed,called Human Virus Predictor(HVP).This method constructs a virus similarity network using information from both virus proteomes and genomes,and trains a GCN model to predict the likelihood of a virus infecting humans.Tests on the benchmark dataset show that HVP outperforms existing methods,with an AUC score of 0.848 compared to 0.773 for existing methods.HVP also has a higher F1-Score by 0.096 compared to existing methods.In an independent test set that includes human-related viruses and other unknown host viruses,HVP has a detection rate for human-related viruses that is 10.5% higher than existing methods.These results indicate that HVP can more accurately identify viruses that infect humans and provide an effective method for the prevention and control of emerging viral infectious diseases.5)A novel method for predicting prokaryotic virus hosts based on Gaussian model is proposed,called Prokaryotic virus Host Predictor(PHP).This method uses Gaussian model to fit the k-mer frequency differences between virus and prokaryotic genome sequences,and calculates likelihood values for each potential host based on the k-mer frequency differences between the viral genome to be predicted and all potential host genomes.The potential host with the highest likelihood value is selected as the predicted host for the viral genome to be predicted.This study evaluated the performance of PHP using a rigorous clustering-based training and testing data partitioning approach,and found that it outperformed two other sequence features-based methods in benchmark datasets and testing datasets.Although sequence alignment-based methods have higher prediction accuracy compared to three sequence feature-based methods,they are unable to predict certain viruses.However,PHP performs significantly better than the other two sequence feature-based methods for these viruses,indicating that PHP can complement sequence alignment-based methods effectively.In a virus-bacteria infection relationship dataset identified by bacterial single-cell sequencing,PHP achieved an accuracy of 80% at the host family level,indicating that PHP can accurately predict virus hosts in practical research.Furthermore,this study developed corresponding local software and online websites for more convenient use by researchers,providing a fast and effective tool for predicting prokaryotic virus hosts in viromics research.In summary,this article estimates the diversity space of global viral genetics for the first time,develops a classification method for prokaryotic viruses to assist in accurate and reasonable virus classification,and finally develops two new methods and tools for predicting virus hosts to help identify human viruses and predict prokaryotic virus hosts.These studies not only deepen our understanding of virus genetic diversity and virus hosts but also provide methodological guidance for subsequent studies on virus-host interactions and the prevention and treatment of viral diseases,with good theoretical and practical value. | | Keywords/Search Tags: | Virus diversity, Genome structure, Virus classification, Virus host prediction, Machine learning, Prokaryotic virus, Human virus | PDF Full Text Request | Related items |
| |
|