Font Size: a A A

Markov chain Monte Carlo applications in bioinformatics and astrophysics

Posted on:2006-11-19Degree:Ph.DType:Thesis
University:Harvard UniversityCandidate:Kang, HosungFull Text:PDF
GTID:2450390008451370Subject:Statistics
Abstract/Summary:
This thesis comprises three applications of Markov chain Monte Carlo methods in astrophysics and bioinformatics. The recent development of high resolution satellite telescopes and the technological advances in genotyping technologies for Single-Nucleotide Polymorphisms (SNPs) have given us a wealth of data in astrophysics and genetics and tremendous opportunities for statisticians to develop complex models and computation techniques. In recent years, thanks to the powerful Markov chain Monte Carlo (MCMC) sampling method and improvement in computing speed, one can handle the complicated models with large data sets efficiently.; The main statistical framework for the applications in this thesis is data augmentation, which constructs iterative optimization or sampling algorithms via the introduction of unobserved data or latent variables. For deterministic algorithms, the EM (Expectation-Maximization) algorithm is generally used for maximizing a likelihood function or a posterior density. For stochastic algorithms, data augmentation and Gibbs sampling algorithms are popular for posterior sampling.; The first paper describes a new and powerful method for estimating the distribution of the temperature of matter in the outermost layer of the atmosphere of a star using data augmentation and Bayesian hierarchical modeling technique. This new method enables us to fit to either a selected subset of emission lines with measured fluxes or to perform a global fit to the full wavelength range of the instrument, to obtain error bars to determine the significance of features seen in the estimation, and to directly incorporate prior information such as known atomic data errors, systematic effects due to calibration uncertainties, etc.; The second paper proposes a novel genotype clustering algorithm, based on a bivariate t-mixture model, which assigns a set of probabilities for each data point belonging to the candidate genotype clusters. Furthermore, the model allows us to use the probabilistic multi-locus genotype matrices as inputs for haplotype phasing. Combining the genotyping and phasing steps, we can perform haplotype inference directly on raw readouts from a genotyping machine such as the Tagman assay, with less error than other competing methods.; The third paper develops a Bayesian Linkage-Disequilibrium mapping model for complex diseases. Haplotype analysis of disease chromosomes allows us to localize disease mutations as well as to identify historical recombination events descending from founder haplotypes. The primary improvement of this model over previous ones is to discern the locations of two disease mutations as well as their interaction effects.
Keywords/Search Tags:Markov chain monte carlo, Applications, Model, Data
Related items