Font Size: a A A

Algorithms Of Aligning The Third-Generation Sequencing Sequences And Picking The Operational Taxonomic Units

Posted on:2020-10-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z G WeiFull Text:PDF
GTID:1480306740471914Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the development of high-through sequencing technology,a huge amount of biological sequences has been generated.How to deeply mine these sequences has become a big challenge for microbiologists.Metagenomics based on the sequencing technology is independent to the cultivation of individual species in traditional microbial studies.It can study the microbial genomes composition and community function from DNA or RNA sequences by directly extracting genetic materials in environmental samples.Alignment of single molecule sequencing(SMS)sequences generated by the third-generation sequencing technology,and clustering the 16 S r RNA sequences into operational taxonomic units(OTUs)are the foundation in processing large-scale sequence datasets for analyzing microbial composition,diversity and function.Developing the excellent algorithms for aligning SMS seqeucnes and picking OTUs will help explore the microbial world hidden in the sequence data.From the biological sequences,this dissertation is focused on SMS seqeucnes aligning and OTUs picking.The main contributions are as follows:1.In order to thoroughly capture the features of Pac Bio sequencing data,several datasets sequenced by Pac Bio's instrument were used to analyze the distributions of sequence length and sequencing errors,the relationship between error rate and quality values.Then a new Pac Bio sequencing simulator(called NPBSS)with an empirical model was developed.NPBSS firstly generats read length according to logarithmic normal distribution;and choses different base quality values with different proportions;then,NPBSS computes the overall error probability of each base in the read sequence with an empirical model;finally,calculates the deletion,substitution and insertion probabilities with the overall error probability to assign different sequencing errors.Experimental results show that NPBSS fits the error rate of the Pac Bio reads better than other simulators,and the simulated reads of NPBSS are more like real Pac Bio read data.The NPBSS simulator can provide effective benchmark datasets for evaluating the reliability of the sequence alignment algorithms for SMS data.2.To improve the coverage ratio of the alignment region and the read alignment quality for most existing SMS mapping tools,a novel mapping method(called sms Map)for SMS sequences was introduced by locating the alignment starting positions to a reference genome.sms Map firstly constructs the index of the reference genome through the BWT-FM index technique;then,sms Map identifies the starting positions in read and reference genome by introducing a location strategy of computing the starting position credibility;finally,a banded alignment approach with the low column matrix is presented to get the alignment results.Compared with existing methods on five SMS datasets,sms Map is more sensitive that can align more sequences and bases,and can obtain higher aligned coverage ratio.Simultaneously,sms Map is more robust to sequencing errors.3.To reduce the sensitivity of sequencing errors for most heuristical OTUs clustering methods,we proposed a heuristic clustering method(called DBH)by introducing the de Bruijn(DB)graph for seed selection.First,according to the similarity threshold,a series of temporary clusters are formed;then,a DB graph for a cluster is built to generate a new seed to represent this cluster;finally,the remaining sequences are assigned to the corresponding OTUs according to the distance to new seeds.The experimental results show that DBH has better robustness to sequence errors and reduces the overestimation of OTUs number.DBH is also effective to handle largescale datasets with low computational complexity.4.Most existing heuristic clustering methods rely heavily on single seed sequence of each cluster,and one seed can not completely represent this cluster,resulting in low clustering accuracy and quality.To address this issue,a novel dynamic multi-seeds clustering method(namely DMSC)was designed.DMSC selects multi-core sequences(MCS)as the seeds instead of single sequence in one cluster,and dynamically updates the MCS if a new sequence is added to one cluster.The MCS selection and updating can ensure the MCS better represent the cluster to get better clustering results.Compared with other methods on five datasets,it's demonstrated that DMSC can achieve higher clustering accuracy and better clustering quality.
Keywords/Search Tags:Alignment of third-generation sequencing data, OTUs, Clustering, Highthrough sequencing, Metagenomics
PDF Full Text Request
Related items