Font Size: a A A

Comparison And Analysis Of Nucleotide Sequences Based On Alignment-free K-mer Counting

Posted on:2020-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:L FuFull Text:PDF
GTID:2370330572982245Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
As the main carrier of biogenetic information,nucleotide sequence is arguably a foundation of exporting the genetic character of species.We can not only obtain functional information of the genome,such as regulation and genetic variation,but also clarify the phylogenic relationship among different species through genomic analysis.The species and functional diversity of microbial community can be revealed by metagenomic analysis,and thereby the relationship between microbial community and environment as well as host will be explored.Nucleotide sequences comparison is of great importance for genomic and metagenomic analysis.With the development of high-throughput sequencing technology,sequencing data is more accurate and comprehensive.However,there are problems of large data volume and information fragmentation.Thus,how to compare genomic and metagenomic sequences based on high-throughput sequencing data has become an important research topic.Traditional approaches to sequences comparison obtain relevant species information based on sequence alignment or assembly.Whereas,they are limited by the reference gene library,meanwhile,it is time consuming during the process of assembly and alignment.Therefore,to identify group-specific sequences between two groups of sequencing samples with k-mer counting in this paper,we proposed an alignment-free and assembly-free computational framework.For genome comparison,we proposed an alignment-free and assembly-free method to measure the relationship between genomes and quantify the importance of k-mer in depicting the relationship between genomes.We proposed a model called MetaGO,which took long k-mer(?30bp)sequence as the feature.In which,the feature that is present,or rich,in one group,but absent,or scarce in another group was considered as "group-specifi" feature.To improve computing efficiency,we deployed MetaGo on Apache Spark for parallel computing.In this paper,MetaGo was applied to a simulated and three real metagenomic datasets related to diseases.At the same time,we utilized the predictive ability of the model constructed by the group-specific features to verify the discriminative power of group-specific features.Experimental results show that MetaGo can accurately find the sequences which we set differently between groups in the simulated dataset.Moreover,compared with previous studies,experimental results on real data sets reveal the considerable improvement on the classification of classifier constructed by the Group-specific characteristics identified by MetaGo.It indicates that MetaGo can effectively grasp the differences between different groups,which is of great significance for further understanding of microbial communities or other similar types of sequences.Furthermore,to evaluate k-mer feature importance based on Siamese neural network,a model was proposed in this paper,in which the short k-mer(?<10bp)was regarded as feature.Siamese neural network was used to map a pair of genomic k-mer counting vector to a low dimensional space,respectively.Then the network was trained by minimizing the loss function,which is the error sum of squares between the distance of the two genome in low dimensional space and genetic standard ANI value.Experimental results on the genome sequences of 28 vertebrates demonstrated that the species close to each other in the phylogenetic tree had similar key k-mer features,indicating that the k-mer feature importance which quantized by the proposed model could reflect the similarity of the genome.
Keywords/Search Tags:Nucleotide sequence comparison, k-mer counting, Genomics, Metagenomics, Siamese neural network
PDF Full Text Request
Related items