Statistical methods to address the challenges posed by rare variants and missing genotypes in case-control resequencing studies

Posted on:2014-02-24

Degree:Ph.D

Type:Dissertation

University:University of Miami

Candidate:Kinnamon, Daniel Davis

Full Text:PDF

GTID:1450390008958171

Subject:Biology

Abstract/Summary:

Case-control resequencing studies are growing in popularity as investigators apply novel massively parallel sequencing technologies to existing case-control data sets. However, the sequence data generated by these studies present several daunting analytic challenges. The present study focuses on addressing the challenges posed by rare variants and missing genotypes when performing a test for association between a disease and a locus using data from a case-control resequencing study.;Association tests that pool minor alleles into a measure of burden at a locus have been proposed to address allelic heterogeneity in the presence of rare variants. However, such pooling tests are not robust to the inclusion of neutral and protective variants, which can mask the association signal from risk variants, and may not be robust to randomly missing genotypes. In contrast, methods for locus-wide inference using nonnegative single-variant test statistics are robust to both the inclusion of neutral and protective variants and randomly missing genotypes. Therefore, three existing methods for locus-wide inference using nonnegative single-variant test statistics were compared to two widely cited pooling tests under realistic conditions. Analytic results for a simple model with one rare risk and one rare neutral variant demonstrated that pooling tests are less powerful than even Bonferroni-corrected single-variant tests in most situations. These results were extended by Monte Carlo simulations using variants with realistic minor allele frequency and linkage disequilibrium spectra, disease models with multiple rare risk variants and extensive neutral variation, and varying rates of randomly missing genotypes. In all scenarios considered, at least one existing method using nonnegative single-variant test statistics had power comparable to or greater than the two pooling tests considered. These results suggest that efficient locus-wide inference using single-variant test statistics should be reconsidered as a useful framework for addressing the challenge posed by rare variants in case-control resequencing studies.;Methods that perform efficient locus-wide inference using nonnegative single-variant test statistics also partially address the challenge posed by missing genotypes because they can use all available genotype data. When these methods are based on permutation tests, inferences will be valid if genotypes are randomly missing—that is, if the probability of a missing genotype at a variant does not depend on other observed or unobserved variables in the study. However, it was unclear whether methods based on permutation tests would yield valid inferences for nonrandomly missing genotypes. Therefore, a rigorous theoretical framework for constructing valid permutation tests was developed for genetic case-control studies with unrelated subjects and missing genotypes arising from a variety of missing data processes. The development began with the specification of a nonparametric probability model for the observed data in such a study. Group-theoretic arguments were then used to establish two conditions that together guarantee an exact level-α Monte Carlo permutation test for data generated under this nonparametric probability model. One of these conditions is not satisfied for the most frequently used Monte Carlo permutation test, and this test is guaranteed to be level α only for missing data processes with certain characteristics. An alternative Monte Carlo permutation test, which is exact level α as long as all covariates influencing the missing data process are identified and recorded, was therefore proposed. The theoretical development was supplemented with Monte Carlo simulations for a variety of test statistics and missing data processes. These results demonstrate that Monte Carlo permutation tests must be constructed with careful consideration of the missing data process to adequately address the challenge posed by missing genotypes and avoid inferential errors.

Keywords/Search Tags:

Missing genotypes, Data, Case-control resequencing, Posed, Variants, Using nonnegative single-variant test statistics, Studies, Address

Related items

1	A study of methods for missing data problems in epidemiologic studies with historical exposures
2	The Statistical Inference Of Zero-inflated Negative Binomial Regression Model With Missing Data
3	Research And Implementation Of Geocoding System For Police Case Address
4	Goodness-of-fit Tests For Logistic Regression Models In Stratified Case-control Studies
5	Semiparametric Statistical Inference Of Mean Values Under The Dual Case-control Data
6	Statistical Inference Of The Proportional Hazards Cure Model For Survival Data
7	Intensive Detection Of Genomic Variants
8	EM Algorithm For Binary Markov Chains Of Longitudinal Data With Missing Data
9	High-resolution spectral analysis: The missing data case
10	The treatment of missing data in process monitoring and identification