Studies On Transcriptome Assembly Algorithm Based On Data Fusion

Posted on:2023-05-16

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Y Zhao

Full Text:PDF

GTID:1520306614483254

Subject:Applied Mathematics

Abstract/Summary:

PDF Full Text Request

With the vigorous development of life science and technology in the era of big data,the relevant data are growing exponentially.The progress of biology is increasingly dependent on the collection,storage,retrieval,analysis,and mining of such data.A new hybrid science,bioinformatics,has emerged as the times require and is rapidly becoming one of the frontiers and core areas of life and natural science.It starts from data and uses mathematics,computer science,statistics,and cybernetics as approaches to deeply mine the various types of information contained in biological data.Transcriptomics is a crucial branch of bioinformatics.It explores the transcription of genes in cells at the RNA level,such as understanding the extent of mRNA splicing and the regulation of gene expression by non-coding RNA.The transcriptome assembly algorithm is the prerequisite for the subsequent analysis of transcriptomics.It provides support for downstream studies such as transcriptome regulatory laws,gene differential analysis,co-expression analysis,and enrichment analysis.And it plays a fundamental foreshadowing role in the study of transcriptomics.The main transcriptome assembly algorithms basically follow two strategies:de novo assembly and reference sequence-based assembly(genome guide).In recent years,researchers have developed a large number of assembly algorithms,which have greatly contributed to the development of transcriptomics.However,we found that the current algorithms still have great limitations through testing.Furthermore,the recall and accuracy of mainstream algorithms are still low.With the advancement of sequencing technology,more and more data information is available to us.Fusing available information to develop accurate and efficient transcriptome assembly algorithms is still a fascinating and challenging research topic,which is the main focus of this dissertation.Currently,the specific problems faced in transcriptome assembly include 1)how to fully utilize the known information to construct a more accurate splicing graph that can represent gene expression;2)how to accurately address uncertainty in variable splicing events at both ends of exons,which is the core problem of transcriptome assembly.In addition,almost all transcriptomics studies involve RNA sequencing of multiple samples.For downstream quantification and differential expression analysis,it is necessary to create a consistent transcriptome assembly for multiple samples of RNA sequencing data.This is the problem of transcript reconstruction based on multiple RNA-Seq data.For transcriptome assembly,how to fuse the information of multiple samples into the same splicing graph while reflecting the unique information of each individual sample is also an urgent problem to solve.To solve these above issues,this dissertation investigated the shortcomings of existing algorithms thoroughly,delved deep into the available valid information,and utilized knowledge related to graph theory,combinatorial optimization,etc.From the perspective of fusing different aligners’ mapping results,we designed the Tiglon algorithm and introduced the Labeled Splicing Graph model.Meanwhile,from the perspective of fusing the information of different samples,we designed the TransMeta algorithm and introduced the Vector Weighted Splicing Graph model.These two algorithms introduced two new graph models and designed different assembly algorithms,respectively,which can effectively solve the problem of reconstructing full-length transcripts and overcome existing algorithms’ defects to a certain extent.Relevant research is based on the "Next-generation" Sequencing Technology(the High-throughput Sequencing).The positive results mainly include:1.The Tiglon algorithm:The first step of the genome guide assembly strategy is mapping RNA-Seq data.This step always utilizes aligners,and a single aligner has strong data preferences due to its distinctive design methods.We found that a single aligner often failed to capture all variable splicing events in a gene,which directly affected the subsequent splicing results.To address these problems,we developed the Tiglon algorithm based on aligners fusion.Tiglon can take full advantage of the mapping information generated by different aligners.The main innovations include:1)the first proposal to fuse the mapping results of multiple aligners as input,which can reduce the data preference of individual tools and make the input information more accurate.2)The first proposal of a new graph model,called Labeled Splicing Graph.A label is added to each edge weight in the Labeled Splicing Graph.It is clear to know how many aligners can recognize the reads supporting the edges through the labels,effectively distinguishing the information sources in order to capture more correct splice junctions.3)We designed a label-based dynamic path search algorithm that uses the unique label weights of the Label Splicing Graph to analyze and calculate the confidence level based on information such as the number of aligners by which reads are identified.And then,we use this as a basis to selectively extend the graph to find path overlays representing transcripts.The performances of Tiglon versus other commonly used tools were evaluated in multiple dimensions,such as the number of accurate transcripts reconstructed,precision,and F-Score on one simulated dataset and 50 real datasets.The results show that Tiglon’s performance is greatly improved over other assembly tools,whether for the single mapping result of HIS AT2,STAR,or the fused results of both.In particular,Tiglon has a significant improvement in reconstructing low expression transcripts(which is a more complex problem in transcriptome assembly and one of the important indicators to examine the excellence of the algorithm).2.The TransMeta algorithm:almost all transcriptomics studies involve multiple samples.Aiming to advance the multi sample assembly algorithm,we developed the TransMeta algorithm based on multi samples data fusion.The main innovations include:1)the concept of Vector Weighted Splicing Graph is proposed for the first time,which is different from the assignment method commonly used in previous algorithms.VWSG uses a vector to weight the edges and nodes in it.The element at the kth position of the vector is the corresponding weight in the sample k.This operation can keep the sequencing information of each sample intact by avoiding information loss and achieving the effect of seeking common ground while conserving differences.2)For the first time,we use cosine similarity to sort out the relationship between weighted edges and nodes.And then we use the included angle of adjacent edge vector weights instead of the norm to calculate the similarity to solve the repeat problem.That is,we only focus on the similarity of information in different samples of the same read.This approach is more in line with the characteristics of the multiple samples data and facilitates more accurate path extension.3)The newly designed path search algorithm based on labels is used to reconstruct the transcriptome.TransMeta uses a transcript selection algorithm to generate a multi-sample transcriptome.Then based on this,the data errors of individual samples are corrected,and sets of transcriptome are output for each sample.TransMeta was extensively tested against the best tools available from several evaluation criteria such as recall,precision,precision-recall curve,and F-Score.We tested all tools on 25 sets of RNA-Seq simulated data samples,5 large sets of human RNA-Seq real data sets(including a total of 189 sample data),and two small sample data sets(including a total of 9 sample data sets).At the multi-sample assembly level,TransMeta achieves the best precision-recall curve at a wide range of coverage thresholds,outperforming PsiCLASS,StringTie2(as well as its merged model),Scallop,and TACO.Also,TransMeta consistently achieves significantly higher recall and higher or comparable precision at the individual sample level.Tiglon and TransMeta are both open source software,implemented through the C++language,and the download addresses of the two software areTiglon:https://github.com/yutingsdu/Tiglon-v.1.1.git.TransMeta:https://sourceforge.net/projects/transassembly/files/TransMeta/.

Keywords/Search Tags:

Data Fusion, Transcriptome Assembly, Labeled Splicing Graph, Vector Weighted Splicing Graph, Multiple RNA-Seq Samples Assembly, RNA-Seq

PDF Full Text Request

Related items

1	Studies Of Transcriptome Assembly Algorithm Based On Multi-strategy Fusion
2	Algorithm Studies Of Transcriptome Assembly Based On High Throughput RNA-seq Data
3	De Novo Transcriptome Assembly From RNA-seq
4	Algorithm Studies Of Transcriptome Assembly Basad On Next Generation RNA-seq Data
5	Study On The Characters Involved In Splicing Mechanism Of Eukaryotic Genes
6	Computational studies with ESTs: Assembly, SNP detection, and applications in alternative splicing
7	Research On De Bruijin Graph For DNA Sequence Assembly
8	Optimization On Genomic Big Data Assembly
9	Comparison Of Transcriptome Assembly Software For Next-Generation Sequencing Technologies
10	Research On Haplotype Assembly Algorithm Based On Graph