Font Size: a A A

Algorithm Studies Of Transcriptome Assembly Based On High Throughput RNA-seq Data

Posted on:2018-10-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:J T LiuFull Text:PDF
GTID:1310330512981452Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
With the rapid development of current biology,a large amount of biological data is generated every day.This clearly provides an unprecedented opportunity for study and development of biology.However,it is highly challenging and even impossible for traditional biological technology to efficiently process these huge data.Bioinformatics,as a burgeoning interdisciplinary scientific field,is rapidly rising,which combines techniques in mathematics,computer science and statistics to solve problems in biology,especially the huge amount of biological data.In Bioinformatics,one of the most important and quite challenging problems is sequence assembly,which is very basic and of which transcriptome assembly is one of the most important.The aim of transcriptome assembly is to assemble all expressed transcripts in an experimental transcriptome and simultaneously accurately estmate their expression levels using massive sequencing reads from RNA-seq data.In this study,we mainly focus on how to apply classic combinational optimization strategies to transcriptome assembly,which will benefit the studies of new species,and human diseases related to alternative splicing,especially cancer research.As the rapid development of next generation sequencing,RNA-seq is becoming an essential and powerful tool for transcriptome analysis.However,there also come new computational challenges associated with the interpretation of the generated RNA-seq data.Computational strategies for the transcriptome assembly can be generally divided into two categories:1)genome-guided and 2)de novo transcriptome assembly methods.When there exists a reference genome of high quality,genome-guided appoaches usually start by mapping those RNA-seq reads to the reference genome,and then assemble transcripts for each gene based on mapping results.De novo approaches directly use the reads to assemble transcripts,without using any reference information,which is very important and the only choice when reference genome is unavailable,incomplete,highly fragmented or substantially altered as in cancer tissues.No matter genome-guided or de novo assembly,the computational study of the problem is now facing a bottleneck in both effectiveness and efficiency,which quite limits their applications in practice.Therefore,it is imperative to develop novel and high-quality algorithms for both strategies in order to accurately recover full-lentgh transcriptomes in eukaryotic species.Based on the above considerations,this study proposed a novel genome-guided transcriptome assembly algorithm TransComb,which is developed using compelely new design ideas and solves the current bottleneck to a great extent.Tested on both simulated and multiple real datasets,TransComb demonstrates significant improvements in both recall and precision,and greatly alleviate the plight of high false positives.In addition,after a careful comparison with other methods in terms of computational resource assumption,TransComb runs much faster and requires quite less memory on average.So we conclude that TransComb performs much better than other algorithms no matter in accuracy or computational efficiency.The great improvement of TransComb mainly comes from the following advantages:1)Novel techniques for accurate construction of splicing graphs.e.g.TransComb utilized the paired-end reads to repair the exons fragmented due to low expression levels of the genes,and corrected the exons wrongly merged due to sequencing or mapping errors by sliding a window along the reference genome.2)Solution of the key difficulty in transcriptome assembly via the newly designed cocombing strategy and the utilization of paired-end information.Resolving the ambiguities in linking in-and out-splicing junctions at each exon with multiple splicing junctions is the toughest task in development of assemblers.Most of the existing assemblers have been suffering the predicament resulting in their low assembly accuracies.The newly designed combing strategy subtly integrates the coverage and paired-end information,solving this key difficulty to a great extent.3)Novel developed graph model:weighted junction graph.Rather than working from the splicing graph as others did,TransComb assembles the expressed transcripts from junction graph,which contains more useful information,and therefore,it overcomes many disadvantages of the existing methods.4)A newly designed path extension strategy based on junction graph.In each extension,the new strtegy always extends the current path to the neighboring node supported by credible information from the edges weights in junction graph and so each predicted path has a very high probability to represent an expressed transcript no matter its expression level is low or high.Though TransComb demonstrates significant advantages,it still has some shortcomings as follows.1)We did not parallelize the current version of TransComb,and so the implementation of TransComb needs further optimization.2)For expression level estimation of the assembled transcripts,the current version of TransComb does not take the sequencing preference into consideration,which leads to TransComb performing similar with other leading estimators on some datasets.So,it still needs improvement in expression level estimation.Another new transcriptome assembly method BinPacker will be briefly introduced in the end,which is our newly developed de novo transcriptome assembler.BinPacker remodels the problem as tracking a set of trajectories of items with their sizes representing the coverage of their corresponding isoforms.This approach,which subtly integrates the coverage information into the procedure,has two exclusive features:1)only splicing junctions are involved in the assembling procedure;2)massive pell-mell reads are assembled seemingly by moving a comb along junction edges on a splicing graph.Tested on both simulated and real datasets,results showed that BinPacker performs much better than almost all the existing de novo assemblers,including the most widely used one,Trinity.On some datasets,it even outperforms some genome-guided assemblers,such as the most famous one,StringTie.In addition,it runs substantially faster and requires less memory space than most of the compared assemblers.TransComb and BinPacker have been implemented by C++ and freely available from:http://sourceforge.net/projects/transcriptomeassembly/files/.
Keywords/Search Tags:Bioinformatics, Alternative splicing, Next generation sequencing, Transcriptome assembly, Bin packing model
PDF Full Text Request
Related items