Font Size: a A A

Genome-guided Transcriptome Assembly

Posted on:2023-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:C C LiFull Text:PDF
GTID:2530306614980359Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,thanks to the rapid development of high-throughput sequencing technologies such as RNA-seq,mRNA sequencing data has undergone a qualitative leap,and the cost of sequencing has dropped significantly,providing more possibilities for transcriptome related research,such as transcriptome reconstruction.However,due to the relatively "short"sequencing sequences generated by high-throughput sequencing technologies such as RNA-seq,it is impossible to directly detect the "full-length" sequence of the mRNA,or the transcript.So correspondingly,the problem of transcriptome assembly emerged.The main content of this thesis is genome-guided transcriptome assembly,that is,with the assistance of the reference genome,using RNA-seq massive sequencing data to reconstruct all transcripts that appear during the gene expression process and estimate their abundances.The existence of alternative splicing in the transcription process of eukaryotes greatly enriches the diversity of gene functions.Aberrant alternative splicing is also one of the important causes of diseases such as cancer.Therefore,the study of transcriptome is of great significance in both biology and medicine.However,the uncertainty of exon combination that lies in alternative splicing brings great challenges to our research.Although the related assembly algorithms are constantly updated,they are still far from achieving the ideal results,and the research on new assembly algorithms is still imminent.Genome-guided transcriptome assembly first aligns the sequenced short reads to the reference genome to obtain alignment information,and then constructs a corresponding graph representation based on this information,such as splicing graph,and then based on this graph,designs the corresponding assembly algorithm and implements the softwarew.By analyzing and comparing the advantages and disadvantages of the current popular assembly algorithms,we find that these algorithms almost unanimously tend to fall into the local part of the corresponding graph for path expansion assembly.Therefore,we propose a new algorithm for genome-guided transcriptome assembly,TransCoord,that coordinates the assembly process as a whole.It first constructs a set of candidate paths from the splicing graph using two different path extension strategies,and then thrusts it into a two-phased linear programming model to minimize edge coverage differences and make the predicted result set to fit as closely as possible to that represented by the splicing graph,and at the same time,to assemble as few predicted transcripts as possible while making sure the necessary conditions are met.In this way,the assembly result is equivalent to the overall output of coordinating the selection of all candidate transcripts,rather than a simple merging of the separate assembly parts,which is also the innovation of this algorithm.By comparing the assembly results on 19 Homo sapiens and 5 Arabidopsis thaliana real datasets with 4 state-of-the-art assembly algorithms,TransCoord has assembled the most correct transcripts and obtained the highest recall rate,showing obvious competitive advantage.This thesis also introduces TransLayer,another new genome-guided transcriptome assembly algorithm that we designed and implemented.It is mainly based on our newly designed extended graph,to perform hierarchical transcript expansion assembly of different start and end combinations,and to use the maximum flow method to estimate the abundance.The former two are also its innovation points.This assembly algorithm has surpassed the current 2 state-of-the-art assembly algorithms in recall rate on 19 human datasets.At present,TransCoord has been published publicly,and the corresponding software is now freely available at:https://github.com/lcc121/TransCoord.
Keywords/Search Tags:RNA-seq, transcriptome assembly, genome-guided, linear programming
PDF Full Text Request
Related items