Font Size: a A A

Mining Large-scale Sequencing Reads to Learn Mutational Processe

Posted on:2018-01-02Degree:Ph.DType:Dissertation
University:Yale UniversityCandidate:Li, ShantaoFull Text:PDF
GTID:1444390005953741Subject:Bioinformatics
Abstract/Summary:
Mutations are alterations in the DNA. They can have critical and permanent functional and evolutionary consequences. Learning the fundamental underlying mutational processes and mechanisms is the cornerstone of genomics research. With modern high-throughput sequencing, researchers have access to unprecedentedly abundant DNA mutation data. This dissertation work provides computational methods and analysis results on large-scale sequencing reads to unveil mutational processes details in human DNA. Specifically, I focus on 1) single nucleotide variants in human cancer, 2) deletion breakpoints and 3) retroduplications in human germline. In cancer, I develop a LASSO based method to identify active mutational processes in tumor samples. It gives sparse, biologically interpretable solution and is able to leverage on prior knowledge learned from pan-cancer analysis. Furthermore, I propose a generative model to integrate mutational heterogeneity in both nucleotide contexts and genomic locations. By exploiting mutational processes fingerprints in both aspects, this framework is potentially capable to better identify mutational processes and help reveal underlying biology knowledge. Using papillary renal cell carcinoma (pRCC) and data from Pan-cancer Analysis of Whole Genomes (PCAWG) as case studies, I showcase the power of these methods in cancer genomics. In human germline, I jointly analyze the 1000 Genomes Project data with other genomic annotations. I demonstrate how strong selection and mutational mechanisms together shape deletion distribution in human genomes. In addition, I develop a method specifically targeting retroduplications in human genomes. Using this method, I obtain the largest human retroduplication variation set from 26 populations. These retroduplications reveal population structure and give hints on human recent evolution and divergence. Further insertion point analysis shows how selection and mutational processes drive the nonrandom distribution of retroduplication in the genome. Finally, to address biological data explosion, I optimize the algorithm for a Monte Carlo simulation method in protein surface sampling. The new algorithm lowers down the computational complexity to O(n 2) and thus essentially permits the sampling method to be applied on real world large proteins and complexes.
Keywords/Search Tags:Mutational, DNA, Method, Sequencing, Human
Related items