Improving quality of high-throughput sequencing reads

Posted on:2016-01-02

Degree:Ph.D

Type:Dissertation

University:University of Illinois at Urbana-Champaign

Candidate:Heo, Yun

Full Text:PDF

GTID:1470390017982270

Subject:Computer Engineering

Abstract/Summary:

Rapid advances in high-throughput sequencing (HTS) technologies have led to an exponential increase in the amount of sequencing data. HTS sequencing reads, however, contain far more errors than does data collected through traditional sequencing methods. Errors in HTS reads degrade the quality of downstream analyses. Correcting errors has been shown to improve the quality of these analyses.;Correcting errors in sequencing data is a time-consuming and memory-intensive process. Even though many methods for correcting errors in HTS data have been developed, no one could correct errors with high accuracy while using a small amount of memory and in a short time. Another problem in using error correction methods is that no standard or comprehensive method is yet available to evaluate the accuracy and effectiveness of these error correction methods.;To alleviate these limitations and analyze error correction outputs, this dissertation presents three novel methods. The first one, known as BLESS (Bloom-filter-based error correction solution for high-throughput sequencing reads), is a new error correction method that uses a Bloom filter as the main data structure. Compared to previous methods, it allows for the correction of errors with the highest accuracy at an average of 40 x memory usage reduction. BLESS is parallelized using hybrid OpenMP and MPI programming, which makes BLESS one of the fastest error correction tools. The second method, known as SPECTACLE (Software Package for Error Correction Tool Assessment on Nucleic Acid Sequences), supplies a standard way to evaluate error correction methods. SPECTACLE is the comprehensive method that can (1) do a quantitative analysis on both DNA and RNA corrected reads from any sequencing platforms and (2) handle diploid genomes and differentiate heterozygous alleles from sequencing errors.;Lastly, this research analyzes the effect of sequencing errors on variant calling, which is one of the most important clinical applications for HTS data. For this, the environments for tracing the effect of sequencing errors on germline and somatic variant calling was developed. Using the environment, this research studies how sequencing errors degrade the results of variant calling and how the results can be improved. Based on the new findings, ROOFTOP (RemOve nOrmal reads From TumOr samPles) that can improve the accuracy of somatic variant calling by removing normal cells in tumor samples.;A series of studies on sequencing errors in this dissertation would be helpful to understand how sequencing errors degrade downstream analysis outputs and how the quality of sequencing data could be improved by removing errors in the data.

Keywords/Search Tags:

Sequencing, Data, Errors, Quality, HTS, Error correction, Reads, Variant calling

Related items

1	Algorithmic Study On Long Read Assembly Error Correction Based On Linked Reads Sequencing Data
2	Development of SRADE tool and analysis of quality scores of the reads of Next-Generation Sequencing data
3	Genotype Calling And SNP Detection For Single-cell DNA Sequencing Data
4	Statistical methods for genome variant calling and population genetic inference from next-generation sequencing data
5	Gene Identification Via Phenotype Sequencing
6	Detection Of Genome Structural Variantions Based On Third Generation Sequencing Data
7	Detection Of Genome Variants Based On Hight Throughput Sequencing Data
8	Optimization Research And Implementation Of DNA Sequencing Data Analysis Tool MuTect2
9	Test And Comparation Of Softwares Suitable For RNA-seq Reads Mapping Via Simulated And Real Reads
10	Research On Calling Methods Of Structural Variation Based On Third Generation Sequencing Data