Font Size: a A A

Statistical learning applied to transcriptional regulation in small N, large D domains

Posted on:2014-07-20Degree:Ph.DType:Thesis
University:The Johns Hopkins UniversityCandidate:Simcha, David MFull Text:PDF
GTID:2458390008454177Subject:Engineering
Abstract/Summary:
The last 15 years have witnessed an explosion of high-throughput biological data, where huge numbers of variables (dimensions) are measured for each patient or organism (sample). The most important examples include DNA and RNA sequencing and microarrays. Early hopes that access to such huge volumes of information would revolutionize the field have been moderated by the difficulties of analyzing such data. There is usually no feasible way to interpret this data manually, so statistical learning is typically used. An important limitation, though, is the "small N, large D" problem: Examining a huge number of dimensions with limited sample size increases the occurrence of spurious results due to chance and provides limited ability to infer complex interactions. This thesis focuses on improving statistical learning methodology with respect to high-throughput biological data in three specific areas.;The first area is inference of phenomenological gene regulatory network, or determining what genes will be affected by perturbing the expression of a given gene. This is done by integrating high-throughput cytosine methylation data, which has recently become available and has not been previously used, with mRNA expression data. Bayesian networks are then used to infer directed regulatory networks. The method developed is termed IDEM, for Identification of Direction from Expression and Methylation.;A related area is mechanistic gene regulatory networks, where the focus is on gene regulation due to direct interactions between transcription factor proteins and the DNA sequence near their target genes. The subproblem examined in this thesis is de novo motif discovery. It is demonstrated that commonly used generative models of "random" DNA sequence are "too null" and fail to capture important properties of "random" DNA. This motivates a discriminative approach. This approach is difficult, though, because the sample size is effectively limited to the number of coregulated genes or the number of genes to which a given transcription factor binds, whereas the number of possible bindig motifs is enormous. The dimensionality can be several thousand nucleotides. It is shown that, when properly validated, discriminative approaches perform very poorly. Finally, an adjusted logistic regression, or ALR, method is developed to mitigate weaknesses identified in prior methods.;Lastly, a classifier for tumor sites of origin is created by aggregating publicly available data from over 100 studies, therefore increasing sample size to the point where robust prediction is feasible. It is demonstrated that including a large number of studies in the training data mitigates batch and study effects. The accuracy of several classification techniques, including a novel one based on decision trees of top scoring pairs (TSPs), is compared. Finally, it is shown that preserving cross-study diversity of samples is even more important than preserving sample size and the degree to which ordinary cross-validation is overoptimistic relative to cross-study validation is quantified.;Overall, we demonstrate the importance of tailoring learning to the underlying biology, available sample size and appropriate null hypothesis.
Keywords/Search Tags:Sample size, Statistical learning, Data, Large, DNA
Related items