Font Size: a A A

Managing probabilistic data: Toward data-driven biology

Posted on:2008-07-07Degree:Ph.DType:Dissertation
University:University of California, Santa BarbaraCandidate:Ljosa, VebjornFull Text:PDF
GTID:1448390005462415Subject:Biology
Abstract/Summary:PDF Full Text Request
For data-driven research to become reality in microscopy-based biology, analysis algorithms must be able to run with minimal human input on large, diverse collections of images. Such algorithms have remained elusive, however, and qualitative observation by visual inspection dominates the research. Techniques for managing probabilistic data provide new hope. Analysis algorithms can produce probabilistic values, which indicate the confidence that can be placed in each part of the result. Higher-level analyses can exploit the extra information these values contain, and in turn produce their own probabilistic values. Probabilistic values are natural choices for representing aggregated data, and they have many applications besides biology, including moving-object databases, geographical information systems, and sensor networks.;In this dissertation, we demonstrate that it is possible to manage the uncertainty and to obtain, search, and mine probabilistic values. Our segmentation algorithm, which represents the uncertain extent of an object as a probabilistic mask, enables several analyses of the morphology of horizontal cells.;We propose several database techniques for indexing and searching probabilistic values. We present adaptive, piecewise-linear approximations (APLAs), which represent arbitrary probability distributions compactly with guaranteed quality, and an index structure called the APLA-tree. APLA is more precise than previous approximation techniques, so the APLA-tree can answer probabilistic range queries twice as fast. A novel definition of k-NN queries on uncertain data allows APLA and the APLA-tree to answer them efficiently—even on arbitrary probability distributions, for which no efficient k-NN search was previously possible.;We present the first algorithms for probabilistic spatial join (PSJ) queries, which rank their results according to the probabilities of the points and the distances between them. Our plane sweep algorithm exploits the special geometrical structure of the problem and runs in O( n (log n + k)) time, where n is the number of points and k is the number of results. Scheduling the join at the level of blocks of points improves the performance further. Experiments demonstrate speed-ups of two orders of magnitude.;Together, our techniques make viable the use of probabilistic values in image database systems for large-scale analysis and mining.
Keywords/Search Tags:Probabilistic, Data, Algorithms, Techniques
PDF Full Text Request
Related items