Scalable Strategies for Computing with Massive Sets of Data

Posted on:2011-12-09

Degree:Ph.D

Type:Dissertation

University:Yale University

Candidate:Kane, Michael J

Full Text:PDF

GTID:1448390002460257

Subject:Statistics

Abstract/Summary:

The analysis of very large data sets has recently become an active area of research in statistics and machine learning. Many new computational challenges arise when managing, exploring, and analyzing these data sets, challenges that effectively put the data beyond the reach of researchers who lack specialized software development skills or expensive hardware. This dissertation presents new ways of meeting these challenges.;Most of this dissertation is devoted to the Bigmemory Project, a scalable, extensible framework for statistical computing. Currently, the Bigmemory Project is designed to extend the R programming environment through a set of packages (bigmemory, bigtabulate, biganalytics, synchronicity, and bigalgebra), but it could also be used as a standalone C++ library or with other languages and programming environments.;Using the Bigmemory Project as the vehicle, the dissertation proposes three new ways to work with very large sets of data: memory and file-mapped data structures, which provide access to arbitrarily large sets of data while retaining a look and feel that is familiar to statisticians; data structures that are shared across processor cores on a single computer, in order to support efficient parallel computing techniques when multiple processors are used; and file-mapped data structures that allow concurrent access by the different nodes in a cluster of computers. Even though these three techniques are currently implemented only for R, they are intended to provide a flexible framework for future developments in the field of statistical computing.

Keywords/Search Tags:

Data, Sets, Computing

Related items

1	Scalable Strategies for Computing with Massive Sets of Data
2	Computing Of Group Neurons Of Recurrent Neural Networks
3	Study On Incomplete Data Mining Based On Rough Sets And Granular Computing
4	Research On The Methods To Realize Dynamic Sets Based On Reversible Computing
5	Research Of Granular Computing, Rough Sets, Pansystems And Quotient Space Theory
6	Research Of Rough Sets Model On Double Universes Based On Granular Computing
7	Research Of Closed Frequent Item Sets Mining On Distributed Environment
8	Research On The Granulation SVM Model Based On Intuitionistic Fuzzy Sets
9	Large Data Sets Sample Selection Based On Map Reduce
10	Research And Application Of Granular Computing Based On Rough Sets In Data Mining