Font Size: a A A

Assembly And Analysis Of High Throughput Sequencing Data

Posted on:2022-06-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ZhaoFull Text:PDF
GTID:1480306608979979Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
Contigs assembly,i.e.,reconstructing the initial contigs by connecting reads produced by sequencing,is one of the most basic and challenging problems of life science,and its result affects a series of studies such as function analysis.The rapid development of high-throughput sequencing provides opportunities for the study of contigs assembly,but it also brings some unprecedented challenges in computing.High-throughput sequencing technology can generate a great deal of reads,and the length of a read is usually much shorter than the length of an original contig.The key problem of contigs assembly is how to reconstruct the initial contigs through these massive sequencing data.Alternative splicing is typical in the expression of most multi-exon genes.Abnormal alternative splicing is one of the major causes of many diseases including cancer.Assembling and analyzing the transcriptome is thus crucial for studying these diseases and transcriptomics.Transcriptome refers to the set of all the mRNAs after transcription.RNA high-throughput sequencing technology can sequence these mRNAs and generate a great deal of reads,ranging from millions to hundreds of millions.In theory,we can reconstruct all the sequenced transcripts and accurately estimate their expression levels.However,this task is complicated by highly similar paralogs,various alternative splicing variants,different expression levels of isoforms of the same gene,and sequencing errors and bias.The accuracy of the existing transcriptome assembly algorithms still cannot meet the real requirements,even on relatively simple simulated dataset.Therefore,studying the problem of transcriptome assembly based on high-throughput sequencing data and designing effective algorithms are difficult while hot in the field of computational biology and bioinformatics.Many viruses such as Ebola virus,Zika virus,HIV,and Corona virus have shown to have many variants(strains).These variants usually have different toxicity,drug resistance,and pathogenesis,which reminds that strain-level study is must for virus research.Viral quasispecies assembly,i.e.,reconstructing all the strains of the same virus through the sequencing data is an important mean for strain-level study of viruses.For the problem of transcriptome assembly,we developed de novo assembly algorithms IsoTree and DTA-SiST,proposed a general transcripts extraction strategy MultiTrans,and designed an algorithm for transcript expression levels estimation.For the problem of viral quasispecies assembly,we proposed a new de novo assembly algorithm,called LG-Strain.The main contributions of this thesis are as follows:1.The existing transcriptome assembly algorithms usually consist of two main steps,one is to construct a graph to represent alternative splicing,and the other is to extract paths to represent the transcripts.In the process of constructing graphs through reads,most algorithms first split the reads to some short sequences(k-mers),and then construct graphs based on the overlaps between these k-mers.The above graph construction method is called k-mer-based strategy.The k-mer-based strategy can only guarantee k-1-length overlap,but ignores the longer overlapping information between reads.We believed that the longer the overlap between two reads,the more likely they are from the same transcript.In order to make full use of the overlapping information between reads,we developed a read-based splicing graph construction method.This method first selects the thickest unused read as the seed sequence and extends it by continuously selecting reads that maintain the longest overlap with the termininate of this seed sequence.Then,this method makes branches extension and constructs graphs,and then it selects a new read as seed and repeats the above steps until all the reads are used.After the construction of splicing graphs,this method revises the graphs according to types of alternative splicing.In order to extract paths that represent transcripts,we proposed a graph model,called isoform tree.We converted each splicing graph into an isoform tree based on the thickness of the vertices and edges of the splicing graph.In the isoform tree,each path from the root node to a leaf node represents a transcript candidate,and the weight of the leaf node corresponds to the expression value of this candidate.We call the above de novo transcriptome assembly algorithm IsoTree.Experimental results show that IsoTree performs well on both simulated and real datasets,but it takes more time to construct splicing graphs.2.In order to solve the time-consuming problem of IsoTree,we proposed another de novo transcriptome assembly algorithm,called DTA-SiST.The main idea of DTA-SiST is to extend contigs based on suffix trees.DTA-SiST can obtain the read with the longest overlap with the terminate of current contig in 0(L)time,where L represents the length of a read.The space complexity of the suffix tree constructed by this method is 0(N'(L-l)),where N'?N,N represents the number of reads,and l is the minimum overlapping length.In the step of extracting transcript paths,considering that the thicker path is more likely to be a transcript,we proposed a hybrid strategy based on dynamic programming to extract the longest s?t path among the thickest paths.The time consumed by this strategy is 0((V+E)E),where V and E represent the number of nodes and edges,respectively.In order to detect as many transcripts as possible,we developed a depth-first enumeration strategy to recover all the possible transcripts that can be represented by the splicing graph.This algorithm has a time bound of O((V+E)P),where P is the number of paths starting from s.Experimental results show that these two path extraction strategies have their own advantages.3.In this thesis,we proposed a general transcripts extraction algorithm based on mixed integer linear programming,called MultiTrans.According to the existing research experience,comprehensively considering the graph structure,paired-end sequencing information,sequencing depth,and the number of paths,MultiTrans models the transcripts extraction problem as extracting a set of paths with respect to the following goals:firstly,each vertex or edge is covered by at least one path,each paired-end read is covered by at least one path,the coverage of each vertex is as close as possible to the sum of flows of paths that pass through this vertex,and the number of paths is as small as possible.For this optimization problem,we used mixed integer linear programming to solve it.MultiTrans can extract more transcript paths in both splicing graphs and assembly graphs,thanks to its comprehensive consideration of graph structure,paired-end sequencing information,and sequencing depth.In addition,MultiTrans has very high accuracy under effect of the objective of extracting as few transcript candidates as possible.We tested MultiTrans on both simulated dataset and real dataset of human,mice,and rice.The experimental results are consistent with expectations,and MultiTrans can recover more transcripts with higher precision.4.The existing algorithms for de novo assembly of virus strains mostly follow the idea of transcriptome assembly,and the lengths of the assembled strains are usually very different.This is inconsistent with the phenomenon that strains of the same virus have similar lengths.Based on this,we designed a graph model specifically for viral quasispecies assembly with considering the differences between virus strains,called level graph.We first used the reads to construct a backbone as a reference,and then re-aligned the reads to the backbone and calculated the SNVs(Single Nucleotide Variants).Subsequently,we counted the SNV combination types in each dynamic window and deleted the combination types that occur rarely.We regarded each SNV combination as a vertex,and the frequency of the combination as the weight of the vertex.Now,each strain can be represented by a path of length LG,where LG is the number of levels in the level graph.In order to simplify the level graph,we used paired-end sequencing information to locally decompose the level graph,and we modeled the local decomposition problem as an optimization problem.For the optimization problem,we developed a mixed integer linear programming model to solve it.We then started from the level with the largest number of vertices and iteratively merged its adjacent layers until only one layer remains.Note that each vertex in the last level is exactly one strain candidate.After obtaining the candidate sequences,we aligned the reads to these sequences again and updated the sequences according to the alignment results.Finally,we tested our algorithm on the dataset of Zika virus,HIV,polio virus,and hepatitis C virus.Experimental results show that the backbone constructed by our algorithm is highly credible,and it is sometimes even a complete strain sequence.In addition,our algorithm is also comparable in terms of reconstructing the strain sequences.The distribution of contig lengths obtained by LG-Strain is usually closer to the distribution of real strains.5.Although many transcriptome assembly algorithms also calculate the expression value of each transcript,the accuracy of the expression value is usually not very high.Researchers usually take the estimation of transcript expression level as a separate issue for research.For the problem of estimating transcript expression level,we developed PSOISO algorithm.The main idea of this algorithm is to first analyze and calculate the probability that a read comes from a certain exon with assuming the transcript expression levels,and optimizes these values to make the estimated number of reads belonging to each exon is as close as possible to the actual number.In the process of initializing the expression values of the transcripts,PSOISO tried assigning them to random values as well as the expression values calculated by transcriptome assemblers.The experimental results show that the accuracy of the PSOISO is higher than some assemblers in the case of randomizing the initial values.In addition,PSOISO can effectively improve the accuracy over the transcript expression values of Cufflinks,StringTie,Tigar2,and rSeq.
Keywords/Search Tags:High-throughput Sequencing, Contigs Assembly, Alternative Splicing, Viral quasispecies
PDF Full Text Request
Related items