Font Size: a A A

New algorithms, data structures, and user interfaces for machine learning of large datasets with applications

Posted on:2005-09-17Degree:Ph.DType:Thesis
University:Stanford UniversityCandidate:Moraleda, JorgeFull Text:PDF
GTID:2458390008483376Subject:Computer Science
Abstract/Summary:
This thesis comprises new algorithms, data structures, and user interfaces for learning analytic models from large datasets. The theoretical contributions have been validated experimentally in several domains using the Bayesian Networks framework. Two topics are studied in detail: learning good analytic models from data and providing an user friendly interface to visualizing and interacting with these large models.; Learning from data is a hard problem. In particular learning Bayesian Network structure from data is NP-hard. Thus heuristic search is necessary to find good models. There are two approaches to improve heuristic search: Speeding up model evaluation in order to be able to search through a larger number of models in a given search time and using better heuristics to generate higher quality models early in the search. This research has addressed both issues through the development and experimental validation of the AD+Tree and Queue Learning.; The AD+Tree is a data structure that caches counts from the dataset very efficiently, enabling fast evaluation of larger models. Under certain assumptions the AD+Tree allows one to process datasets one order of magnitude larger than those processed with other data structures of similar speed performance. These theoretical boundaries have been validated experimentally. The AD+Tree usefulness extends beyond Bayesian Networks to other analytic modeling paradigms based on discrete data.; Queue Learning is an algorithm for learning Bayesian Network structure that I have shown experimentally produces better models early in the search than existing techniques when applied to large datasets. Queue learning is also an inherently parallel algorithm thus holding the potential for significant speed improvements when used in distributed systems.; Existing using interfaces do not scale well with model size. New user interfaces have been developed to address the challenges of displaying and interacting with larger models. In particular the usage of coloring schemes has increased the number of attributes that can be manipulated comfortably from a few tens to a few hundreds.; A chapter is devoted to presenting three case studies using real world datasets. They exemplify the usage of Bayesian Network automatic modeling and novel user interfaces in genomics, proteomics and financial arbitrage.
Keywords/Search Tags:User interfaces, Data, New, Models, Bayesian network
Related items