Font Size: a A A

Harmonic analysis of ChIP-seq

Posted on:2016-08-18Degree:Ph.DType:Dissertation
University:Yale UniversityCandidate:Stanton, Kelly PatrickFull Text:PDF
GTID:1478390017983676Subject:Bioinformatics
Abstract/Summary:
Next generation sequencing (NGS), genome sequencing and derived techniques have become a mainstay in the genomics laboratory setting. One such technique, chromatin immunoprecipitation followed by high throughput sequencing (ChIPseq) is a method for discovering chromatin modifications or transcription factor binding sites on a genome wide scale. Analysis of ChIP-seq data is the focus of this Dissertation, specifically, noise reduction through modeling sources of technical and biological variation, and applying harmonic analysis to extract biologically meaningful signal.;In the first chapter we present a signal processing approach, called Arpeggio, that characterizes protein- chromatin interaction patterns at length-scales of several kilobases. This allows efficient comparisons between 806 publicly available ChIPseq experiments. This approach preserves biological properties of the signal not easily detected by standard peak callers, such as histone modification periodicity, and facilitates cross species comparisons.;In the second chapter, we address the problem of peak detection in ChIP-seq. Numerous algorithms were developed to detect transcription factor binding sites in the case of transcription factor targeted ChIP. All of these algorithms have several shortcomings, however, which include susceptibility to false positives, poor localization of the binding site, and the requirement for a total DNA input control. This increases the cost of performing these experiments. Upon investigating sources of error, we discovered that many of these problems stem from a mapping issue that produces a shift of one read length between reads from the two strands of DNA. Additionally, these read length artifacts influence the the normalized strand coefficient (NCS) quality control metric that is based on opposing strand cross-correlation. As a consequence, many experiments may be reported as poor quality and therefore discarded. Mismapping of reads in repetitive sequences, or blacklisted regions, can create these read length artifacts. We show how to remove these artifacts and recover the corrected fragment length distribution of any given experiment. We present Ritornello, a novel single end peak calling method that does not require input controls. Ritornello is based on analysis of artifact free fragment length distribution, the shape of peaks at true binding sites, and the distribution of background noise.;By incorporating the detailed characterization of read length artifacts as well as the structure of the signal and noise in ChIP-seq, Ritornello uses peak shape to localize transcription factor binding and addresses a problem of other peak callers, allowing analysis of data that would have otherwise been discarded in the quality control step.;Finally, we present possible improvements to the Ritornello method. These include improved memory usage for storage of the genome, approaches for fragment length distribution discovery, and methods for modeling read length artifacts. Additionally we discuss extensions to Ritornello including abundance quantification necessary for differential binding analysis, and requirements for adapting it for analyses of histone marks and polymerase ChIP-seq.
Keywords/Search Tags:Chip-seq, Read length artifacts, Binding, Fragment length distribution
Related items