Font Size: a A A

Big Data Analytics: Methods and Application

Posted on:2019-02-12Degree:Ph.DType:Dissertation
University:The University of Wisconsin - MadisonCandidate:Paulson, Erik StevenFull Text:PDF
GTID:1478390017488958Subject:Computer Science
Abstract/Summary:
Big Data is now pervasive. This has driven a critical need to develop novel methods to store and process data at large scale, as well as to develop new applications to use and make sense of this data. This dissertation makes two contributions toward addressing this need. First, we study methods for large-scale data analysis. In particular, we compare the popular MapReduce model to parallel relational database management systems, and empirically analyze their strengths and weaknesses. We evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a collection of benchmarks that we have run on an open-source version of MR as well as on two parallel DBMSs. For each benchmark, we measure each system's performance for various degrees of parallelism on a cluster of 100 shared-nothing nodes. Our results reveal some interesting trade-offs. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.;In the second contribution, we examine how Big Data scaling methods can be used to build a scalable and flexible cloud-based entity matching applications, and what lessons can be learned for future development of similar applications. Entity matching (EM) finds disparate data instances that refer to the same real-world entity. EM has been long studied and is crucial to many fields, and will become even more so in the age of Big Data. However, it is still very difficult for domain scientists to use EM systems, especially at scale. In response, we have developed CloudMatcher, a cloud/crowd service for EM. CloudMatcher aims to be a fast, easy- to-use, scalable, and highly available EM service on the Web. As far as we can tell, no such application has been developed for EM in the data management research community. We describe CloudMatcher's development and deployment, providing a detailed analysis of its performance over several representative datasets and in several scale-up experiments, and discussing lessons learned. Taken together, our contributions in this dissertation advance the topic of Big Data analytics, for both aspects of methods and applications.
Keywords/Search Tags:Data, Methods, Applications
Related items