Font Size: a A A

Application Of Statistics Based On K-mer In Biological Sequence Analysis

Posted on:2021-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:G D HuangFull Text:PDF
GTID:2370330611966812Subject:Theoretical Physics
Abstract/Summary:PDF Full Text Request
As a supplement and development of traditional alignment methods,k-mer-based alignment-free statistical algorithm in biological sequence analysis has gradually become a hot field in bioinformatics research.The alignment-free statistical algorithm takes the DNA or protein sequence as a string and forms k-mer through different word combinations.Then the correlation of sequences is revealed by statistics of k-mer frequency in different sequences.However,it is a hot and difficult point to study the accuracy and computational speed of biological sequence using k-mer statistics.However,using k-mer statistics to study the accuracy and computational speed of biological sequences has become a research focus.We first studied the power ok k-mer statistics.The k-mer statistics has the characteristics of low time complexity and space complexity,and it is particularly suitable for comparative genomics.There are many statistics of k-mer based alignment-free.The statistics D2S and D2*performed well in the search for cis-regulatory modules,but poorly in the search for horizontal gene transfer sites.The improved statistics of TsumS and Tsum*based on D2S and D2*were found to be very powerful in the search of horizontal gene transfer.So we further improved the Tsum model,considering the coverage and the fragment length two parameters to adjust the statistical model.The effective range has been found,and we expanded the application scope of TsumS and Tsum*.In this research,we further reveal the statistical power of TsumS and Tsum*.Such statistics calculated by using word patterns have low requirements on sequence integrity,and can provide a new perspective for genome comparison,which is of guiding significance for the processing of NGS data.Then we studied dissimilarity d2S and d2star that based on D2S and D2*,and their application in the analysis of phylogenetic.We downloaded 100 16S rRNA gene sequences from Silva database,calculated the dissimilarity matrix by d2S and d2star,and drew the phylogenetic tree by the UPGMA method,then obtained the phylogenetic tree with different k values.After calculating the symmetric difference between them and the golden tree with the "treedist" tool of Phylip software package,we found that both d2S and d2star perform best in phylogenetic analysis when k=8,which could draw the phylogenetic tree with the highest similarity to the golden tree,with an acceptable symmetric difference and a exact clustering result,and could separate gene sequences at different levels(domain,phylum,class,family,genus).In k-mer based sequence alignment-free,there are classical Euclidean distance(Eu),Manhattan distance(Ma),Chebyshev distance(Ch),and dissimilarity Hao,d2,d2S and d2star,etc.Their values are range from 0 to 1.In order to promote alignment-free statistic method in the research of evolutionary relationships,we developed a software SeqDistK using these seven distance and dissimilarity.SeqDistK can work on Windows,Linux,and Mac systems.After comparing the calculation speed of SeqDistK and three classical commonly used alignment software including ClustalW2,Muscle and MAFFT,we are confirmed SeqDistK has a very excellent speed performance,its time complexity is much lower,that can greatly reduce the time cost of sequence alignment analysis takes.SeqDistK provides a new tool for bioinformatics by adding a new way for the use of sequence alignment-free statistics.
Keywords/Search Tags:alignment-free, k-mer, statistical power, dissimilarity, 16S rRNA
PDF Full Text Request
Related items