Font Size: a A A

Application of non-parametric algorithms to computational biology

Posted on:2010-04-07Degree:Ph.DType:Dissertation
University:Stanford UniversityCandidate:Fratkin, EugeneFull Text:PDF
GTID:1448390002485454Subject:Biology
Abstract/Summary:
Computational biology is an emerging, cross-disciplinary field that combines elements of biology, genetics, mathematics and computer science. As a branch of computer science it is distinct from many others in the way that biological problems have a virtually unlimited complexity. Computational solutions for such problems are fundamentally a trade-off between computational tractability and biological fidelity. Therefore, algorithms developed for computational biology should minimize their dependence on statistical parameterization of the input data.;Over the past thirty years computational biologists took advantage of many data representations, such as Position Specific Scoring Matrix, which represent data of significantly similar sequences in a concise, compact manner. In such representations values are recorded as statistical distributions at each position in the set of sequences. This simplification greatly reduced the input size, in many instances making a problem computationally manageable. However, important information about position interdependence is obfuscated in the process.;In this work I exemplify the benefits of retaining such information in two computational biology problems: the identification of transcription factor (TF) binding sites and the inference of an individual ancestry using dense single nucleotide polymorphism (SNP) array.;The problem of TF binding site discovery can be viewed computationally as the problem of locating a set of substrings within a much larger input, where these substrings share a statistically significant degree of similarity. The similarity does not imply a perfect identity and in fact in many instances as much as 60 percent of the substring content is not preserved. To solve this problem we have developed a natural, graph theoretical formulation. In this formulation each substring of the input is represented as a vertex and edge weights represent a degree of pairwise similarity of two vertices. Using this formulation, we can discover likely candidate motifs as the dense subgraphs within such a graph. We further demonstrated that this representation outperforms other existing methods and is able to find motifs that have a great degree of interdependence between its nucleotides.;In the problem of inference of the ancestry of an individual our input consists of several sets of SNP values---values of nucleotides at specific locations in one's DNA. Some of those sets come from reference individuals, for whom we know the exact ethnic origin. And some come from the individual whose origin is to be inferred. Probability of observing a specific SNP value at a given DNA location may vary slightly from population to population. Our goal is to use this statistical signal to determine the most likely ethnic origin of a specific segment of one's DNA. In our work we extended the existing approaches of using graphical statistical models to the representation in which every sample individual is represented as a distinct set of states. Advantage of such a representation is that it implicitly takes into account the linkage disequilibrium to infer an ancestral haploblocks. From this we obtained significantly higher quality of inference. Using this approach on specifically synthesized populations we demonstrated that given a good reference database of sample individuals one can reconstruct an individual ancestry as many as 20 generations back.;Each approach described in this dissertation exemplifies benefits of working with non-parametric algorithms. In both cases the additional information produced by a non-parametric approach resulted in an improvement of the final prediction.
Keywords/Search Tags:Computational, Biology, Non-parametric, Algorithms
Related items