Font Size: a A A

Research On Key Alternative Splicing Event Identification From RNA-Seq Data

Posted on:2017-03-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y BaiFull Text:PDF
GTID:1220330503469672Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Alternative splicing(AS) is an effective cellular mechanism that produces different transcript isoforms from a single gene-coding region during gene expression, resulting in functional complexity and diversity of proteins for higher eukaryotic organisms. Alternative splicing event identification is of great importance to gene function, structural and functional diversity of protein, cellular differentiation, and species evolution studies. With the development of next-generation sequencing technology, genome-wide AS event identification from RNA-Seq data is a popular research topic in biology. However, accurate identification of exon skipping(ES) and intron retention(IR) events from RNA-Seq data has remained an unresolved challenge in next-generation high-throughput sequencing studies. A variety of computational methods have analyzed RNA-Seq data for AS event identification. While existing methods for ES and IR event identification usually employ part of the features related to ES and IR and omit some significant features, and use reads with low mapping quality. Besides, they do not apply the standard for feature normalization, and not figure out which feature is important for ES and IR event identification.Thus, we are motivated to provide a thorough analysis of all those features, figuring out their relative importance and take in more features to design more precise prediction methods for ES and IR event identification. The major research content of this thesis include the following four parts:(1) Feature analysis on the exon skipping event is presented.Compared with existing ES event identification methods, our method employs more features to interpret each exon, and conducts a thorough analysis on all well-employed features, showing their relative importance in ES event identification. Moreover, we built4 different feature sets for study the effectiveness of different feature normalization.In addition, we experimented on published RNA-Seq data of skeletal muscle, brain,heart and liver tissues in Human species, conducted training examples by incorporating the predictions from three state-of-the-art approaches, analyzed all well-employed features,and studied the effectiveness of different feature normalization. The experimental results show that read counts supporting the exclusive isoform, and the psi score are two important features for ES event identification. Besides, feature normalization has little effects on ES event accurate identification.(2) A novel method EScall for exon skipping event identification based on multiple feature analysis is proposed.Compared with existing work, EScall employs some criteria to filter out reads with low-quality and aligned to multiple locations from Tophat alignments. Moreover, EScall employs more features(read counts within exon, read counts over exon-exon splice junction, gene expression and etc) to define a new formula for alternative exon skipping score calculation.In addition, we experimented on published RNA-Seq data of human skeletal muscle and brain tissues, and compared the predictions with three state-of-the-art approaches.The experimental results show that our EScall could effectively avoid bias, decline false positives and negatives, and detect ES events with higher precision.(3) A novel method IRcall, a combination score for IR event identification from RNA-Seq data, is proposed.Compared with existing work, IRcall employs 7 features(read counts within an intron, read counts supporting splice junction, read counts within flanking exons, read counts overlapping with 5’ splice site, read counts overlapping with 3’ splice site, read coverage within an intron, gene expression RPKM value), to define a new formula IRScore for IR scores calculation, with ranking strategy.In addition, we experimented on published RNA-Seq data of skip mutant and wildtype in Arabidopsis thaliana, and compared the predictions with three state-of-the-art approaches. The experimental results show that our IRcall could effectively avoid bias,decline false positives and negatives, and detect IR events with higher precision.(4) A novel method IRclassifier, a random forest classifier for IR event identification from RNA-Seq data, is proposed.Compared with existing methods, IRclassifier employs machine learning techniques,based on Random Forests, for IR event identification. IRclassifier selects reference examples by incorporating the predictions from three state-of-the-art approaches, and constructs 21 features between treatment and control conditions to represent introns in higherdimensional spaces. Moreover, IRclassifier conducts a thorough analysis on all wellemployed features, showing their relative importance in IR event detection.In addition, we experimented on published RNA-Seq data of skip mutant and wildtype in Arabidopsis thaliana, and conducted training examples by incorporating the predictions on Chromosome 1, Chromosome 2, and Chromosome 4 from three state-of-the-art approaches. The experimental results show that our IRclassifier could effectively identify IR events with a precision over 99.2%. In addition, we used IRclassifier to detect IR events on Chromosome 3 and Chromosome 5, and compared the predictions with three state-ofthe-art approaches. The experimental results show that our IRclassifier could effectively detect IR events with higher precision.
Keywords/Search Tags:Alternative splicing, RNA-Seq, Exon skipping event, Intron retention event, Random forest
PDF Full Text Request
Related items