Font Size: a A A

Scalable Strategies for Computing with Massive Sets of Data

Posted on:2011-12-09Degree:Ph.DType:Dissertation
University:Yale UniversityCandidate:Kane, Michael JFull Text:PDF
GTID:1448390002460257Subject:Statistics
Abstract/Summary:
The analysis of very large data sets has recently become an active area of research in statistics and machine learning. Many new computational challenges arise when managing, exploring, and analyzing these data sets, challenges that effectively put the data beyond the reach of researchers who lack specialized software development skills or expensive hardware. This dissertation presents new ways of meeting these challenges.;Most of this dissertation is devoted to the Bigmemory Project, a scalable, extensible framework for statistical computing. Currently, the Bigmemory Project is designed to extend the R programming environment through a set of packages (bigmemory, bigtabulate, biganalytics, synchronicity, and bigalgebra), but it could also be used as a standalone C++ library or with other languages and programming environments.;Using the Bigmemory Project as the vehicle, the dissertation proposes three new ways to work with very large sets of data: memory and file-mapped data structures, which provide access to arbitrarily large sets of data while retaining a look and feel that is familiar to statisticians; data structures that are shared across processor cores on a single computer, in order to support efficient parallel computing techniques when multiple processors are used; and file-mapped data structures that allow concurrent access by the different nodes in a cluster of computers. Even though these three techniques are currently implemented only for R, they are intended to provide a flexible framework for future developments in the field of statistical computing.
Keywords/Search Tags:Data, Sets, Computing
Related items