Font Size: a A A

Processing Theta-Joins on Shared-Nothing Systems

Posted on:2015-01-29Degree:Ph.DType:Dissertation
University:Northeastern UniversityCandidate:Okcan, AlperFull Text:PDF
GTID:1478390017998070Subject:Computer Science
Abstract/Summary:
Joins are essential for many large-scale data analysis tasks, and a variety of join conditions must be supported for many applications such as data-driven science, advertising, marketing, and social networks. Efficient parallel execution of joins is crucial to cope with the large volumes of data being collected and generated in many disciplines.;We explore how to efficiently process theta-joins in distributed shared-nothing systems. While equi-joins have been studied extensively, it was unclear how to efficiently distribute computation of arbitrary join conditions on a cluster of nodes. We show how to process theta-joins efficiently in parallel when the goal is to minimize response time. We propose a join model that simplifies creation of and reasoning about parallel join algorithms. Using this model, we introduce a randomized algorithm whose response time is provably within a small constant factor of the lower bound for a variety of join problems. For other popular classes of joins where this does not apply, we develop efficient heuristics.;The drawback of our distributed theta-join algorithms is a potentially high degree of input replication depending on the input data distribution, available input statistics, join condition, and cluster properties. We propose lightweight encoding and decoding strategies in order to reduce the amount of data transferred across the network. These strategies are also applicable to a large and diverse spectrum of applications executed using the MapReduce programming model. Data transfer reduction is achieved by dynamically and adaptively performing mapper-side tasks on the reducers.;Finally, we integrate our optimization techniques into Scolopax, a novel system that supports exploratory analysis for data-driven science. Scolopax can explore a huge space of possible hypotheses, returning a ranked list of those that best match the user preferences. Scolopax supports flexible join predicates used by scientists to compute relationships and correlations in high-dimensional scientific data.
Keywords/Search Tags:Join, Data
Related items