Font Size: a A A

Transcriptome Expression Analysis For High-throughput Full Length Data

Posted on:2020-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:X Y QuFull Text:PDF
GTID:2370330590972670Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The calculation of transcriptome expression level is an important means of gene function research,and the existence of alternative splicing of eukaryotes makes it difficult to accurately calculate the expression level of gene isoforms.The third-generation sequencing technology born in recent years is a new experimental method for transcriptome research,which is characterized by the ability to obtain ultra-long reads.The third-generation sequencing technology makes up for the shortcomings of the second-generation sequencing technology that the reads is too short and the isoform detection is difficult.Pac Bio's ISO-seq sequencing technology proposed for the transcriptome brings new opportunities for transcriptome research,especially for the detection of novel isoforms.However,at present,there is little work on the application of ISO-seq data in transcriptome research involving the calculation of isoform expression levels.Some existing work combined ISO-seq with RNA-seq data for the calculation of expression levels.Most of these methods only use the full-length read data which is the minority of all experimental data,and ignore much of the useful information in the discarded non-full-length read data,so the data is not fully utilized,resulting in low data throughput.In addition,the method of using ISO-seq and RNA-seq mixed data takes into account the advantages of both sequencing technologies,but its computational complexity is increased,and the cost of the two sequencing technologies is also high.To address these problems,based on the retention of non-full-length reads,two models,DSIDP and MCIDP,are proposed in this thesis.The proposed methods use only ISO-seq data to predict the isoform structure and to calculate the usage of isoforms.The specific work completed in this thesis is as follows:1)Given that the existing pre-processing framework does not meet the requirements for retaining non-full-length reads in this thesis,a set of data preprocessing methods that preserve full-length and non-full-length reads are first proposed.Starting from the ISO-seq raw data,after four steps consisting of rawdata processing,read error correction,read mapping and exon sequence sorting,the input data of the proposed models is finally obtained.2)For the calculation of the expression level of isoforms with full-length reads,the DSIDP model is proposed to establish isoform prediction sets from full-length reads,while calculating isoforms expression ratio using full-length reads and non-full-length reads.DSIDP maps all reads to the isoform prediction set and uses Dirichlet sampling to solve the multi-mapping problem.The model is validated on both simulated and real data.3)For the detection of ultra-long isoforms without full-length reads,the MCIDP model is proposed,using Markov chain to simulate the random process of alternative splicing among gene exons.MCIDP not only establishes isoform prediction sets from full-length reads,but also predicts extra long isoforms without full-length reads measured in the data.This significantly contributes to the detection of novel isoforms.The model is validated on both simulated and real data.
Keywords/Search Tags:PacBio, ISO-seq, transcriptome expression, the third-generation sequencing, novel isoforms detection, multi-mapping, Dirichlet sampling, Makov Chain
PDF Full Text Request
Related items