Big Data Analytics: Methods and Application

Posted on:2019-02-12

Degree:Ph.D

Type:Dissertation

University:The University of Wisconsin - Madison

Candidate:Paulson, Erik Steven

Full Text:PDF

GTID:1478390017488958

Subject:Computer Science

Abstract/Summary:

Big Data is now pervasive. This has driven a critical need to develop novel methods to store and process data at large scale, as well as to develop new applications to use and make sense of this data. This dissertation makes two contributions toward addressing this need. First, we study methods for large-scale data analysis. In particular, we compare the popular MapReduce model to parallel relational database management systems, and empirically analyze their strengths and weaknesses. We evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a collection of benchmarks that we have run on an open-source version of MR as well as on two parallel DBMSs. For each benchmark, we measure each system's performance for various degrees of parallelism on a cluster of 100 shared-nothing nodes. Our results reveal some interesting trade-offs. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.;In the second contribution, we examine how Big Data scaling methods can be used to build a scalable and flexible cloud-based entity matching applications, and what lessons can be learned for future development of similar applications. Entity matching (EM) finds disparate data instances that refer to the same real-world entity. EM has been long studied and is crucial to many fields, and will become even more so in the age of Big Data. However, it is still very difficult for domain scientists to use EM systems, especially at scale. In response, we have developed CloudMatcher, a cloud/crowd service for EM. CloudMatcher aims to be a fast, easy- to-use, scalable, and highly available EM service on the Web. As far as we can tell, no such application has been developed for EM in the data management research community. We describe CloudMatcher's development and deployment, providing a detailed analysis of its performance over several representative datasets and in several scale-up experiments, and discussing lessons learned. Taken together, our contributions in this dissertation advance the topic of Big Data analytics, for both aspects of methods and applications.

Keywords/Search Tags:

Data, Methods, Applications

Related items

1	Data science for imbalanced data: Methods and applications
2	Big Data Analytics: Methods and Application
3	Sampling-based Bayesian latent variable regression methods with applications in process engineering
4	Kernel-based empirical Bayesian classification methods with applications to protein phosphorylation and non-coding RNA
5	On advancing MCMC-based methods for Markovian data structures with applications to deep learning, simulation, and resamplin
6	Efficient moments-based permutation tests: A framework, methods and applications
7	Scalable kernel methods for machine learning
8	Receding horizon methods in cooperative control for stochastic transportation applications on graphs
9	Methods And Applications Of Data Processing Focusing On Tracking
10	Based On Soa Data Service Methods And Applications