Managing probabilistic data: Toward data-driven biology

Posted on:2008-07-07

Degree:Ph.D

Type:Dissertation

University:University of California, Santa Barbara

Candidate:Ljosa, Vebjorn

Full Text:PDF

GTID:1448390005462415

Subject:Biology

Abstract/Summary:

PDF Full Text Request

For data-driven research to become reality in microscopy-based biology, analysis algorithms must be able to run with minimal human input on large, diverse collections of images. Such algorithms have remained elusive, however, and qualitative observation by visual inspection dominates the research. Techniques for managing probabilistic data provide new hope. Analysis algorithms can produce probabilistic values, which indicate the confidence that can be placed in each part of the result. Higher-level analyses can exploit the extra information these values contain, and in turn produce their own probabilistic values. Probabilistic values are natural choices for representing aggregated data, and they have many applications besides biology, including moving-object databases, geographical information systems, and sensor networks.;In this dissertation, we demonstrate that it is possible to manage the uncertainty and to obtain, search, and mine probabilistic values. Our segmentation algorithm, which represents the uncertain extent of an object as a probabilistic mask, enables several analyses of the morphology of horizontal cells.;We propose several database techniques for indexing and searching probabilistic values. We present adaptive, piecewise-linear approximations (APLAs), which represent arbitrary probability distributions compactly with guaranteed quality, and an index structure called the APLA-tree. APLA is more precise than previous approximation techniques, so the APLA-tree can answer probabilistic range queries twice as fast. A novel definition of k-NN queries on uncertain data allows APLA and the APLA-tree to answer them efficiently—even on arbitrary probability distributions, for which no efficient k-NN search was previously possible.;We present the first algorithms for probabilistic spatial join (PSJ) queries, which rank their results according to the probabilities of the points and the distances between them. Our plane sweep algorithm exploits the special geometrical structure of the problem and runs in O( n (log n + k)) time, where n is the number of points and k is the number of results. Scheduling the join at the level of blocks of points improves the performance further. Experiments demonstrate speed-ups of two orders of magnitude.;Together, our techniques make viable the use of probabilistic values in image database systems for large-scale analysis and mining.

Keywords/Search Tags:

Probabilistic, Data, Algorithms, Techniques

PDF Full Text Request

Related items

1	Probabilistic Based Classification Techniques for Improved Prognostics Using Time Series Data
2	An Improved Probabilistic Database Model And Its Probabilisticn Earest Neighbors Query Research
3	Probabilistic techniques for biological data analysis
4	Multi-dimensional Probabilistic Regression Over Imprecise Data Streams
5	Semi-supervised clustering: Probabilistic models, algorithms and experiments
6	Algorithms Of Probabilistic Frequent Itemsets From Uncertain Data
7	Probabilistic Graphical Models Based On Data Cleaning
8	The Investigation Of Multitarget Data Association Algorithms
9	Theory and algorithms for modern problems in machine learning and an analysis of markets
10	Fuzzy and probabilistic techniques applied to problems of the chemical process industries