Font Size: a A A

Research On Sequence Alignment Algorithm Based On High-throughput Transcriptome Sequencing

Posted on:2017-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2348330491959933Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Recently, next-generation sequencing (NGS) technologies have rapidly developed, which are producing increasingly many sequencing data. How to deal with such sequencing data is always a significant topic in bioinformatics. Next-generation sequencing technologies can be applied into related research of transcriptome, which is named high-throughput transcriptome sequencing (RNA-seq). One of the important functions of the software analyzing RNA-seq data is to reconstitute the pattern of RNA before splicing in cells. Moreover, it should be able to assess the expression level of every isoform. However, the first step of all analytical procedures is to align the sequences from RNA-seq onto the corresponding reference. Introns will be eliminated during transcription from DNA to mature mRNA. Therefore, compared with traditional sequence alignment, alignment of sequences from RNA-seq is very inherently special-that is to say, two parts of the sequence are expected to be aligned onto different exons. Consequently, there is need to design particular algorithms for RNA-seq alignment. Nearly all existing algorithms for RNA-seq alignment are dependent on the canonical signals of splice sites, but many splice sites with non-canonical signals have important biological functions. For example, the GT-TG splice site is related to the human adenylyl cyclase stimulatory G-protein G?s. Therefore, we introduce two new algorithms designed for RNA-seq alignment in order to identify various splice sites.(1) the Algorithm Independent from Canonical Signals of Splice SitesThe algorithm designed for RNA-seq alignment adopts the extension strategy within overlapping seeds, and is named RNAMap. The overlapping property of seeds can guarantee that alignment information of seeds is able to result in alignment of reads. When scanning the genome, RNAMap builts a static table and a dynamic table in order to index seeds and their alignment information. It tries to identify splice sites between left anchors and right anchors without the limitation of canonical signals. The computational experiment of reads with a variety of splice sites indicates that the call rate and the precision of RNAMap reached 92.53% and 97.01%, respectively. RNAMap performed better than other tools for RNA-seq alignment.(2) the Improvement of the Algorithm for RNA-seq AlignmentBased on the extension strategy between non-overlapping seeds, the other algorithm is designed for RNA-seq alignment, named RNAMap 2. It reduces the amount of calculation through the decrease of the number of seeds. To some extent, RNAMap 2 makes up the disadvantage of RNAMap in speed. The computational experiment of reads of 300bp indicates that RNAMap 2 is faster than RNAMap by almost 40%. What's more, RNAMap 2 adopts Needleman-Wunsch global dynamic programming to deal with mismatches of edit distance, which overcomes the shortcoming of RNAMap that only supports mismatches of hamming distance. Another computational experiment indicates that the call rate of RNAMap 2 is higher than that of RNAMap by nearly 2%.
Keywords/Search Tags:bioinformatics, sequence alignment, RNA-seq, index structures, splice sites
PDF Full Text Request
Related items