Studying Recommender Systems to Enhance Distributed Computing Schedulers

Posted on:2017-05-16

Degree:M.S

Type:Thesis

University:Duke University

Candidate:Demoulin, Henri Maxime

Full Text:PDF

GTID:2448390005976239

Subject:Computer Science

Abstract/Summary:

Distributed Computing frameworks belong to a class of programming models that allow developers to launch workloads on large clusters of machines. Due to the dramatic increase in the volume of data gathered by ubiquitous computing devices, data analytic workloads have become a common case among distributed computing applications, making Data Science an entire field of Computer Science. We argue that Data Scientist's concern lays in three main components: a dataset, a sequence of operations they wish to apply on this dataset, and some constraint they may have related to their work (performances, QoS, budget, etc). However, it is actually extremely difficult, without domain expertise, to perform data science. One need to select the right amount and type of resources, pick up a framework, and configure it. Also, users are often running their application in shared environments, ruled by schedulers expecting them to specify precisely their resource needs. Inherent to the distributed and concurrent nature of the cited frameworks, monitoring and profiling are hard, high dimensional problems that block users from making the right configuration choices and determining the right amount of resources they need. Paradoxically, the system is gathering a large amount of monitoring data at runtime, which remains unused.;In the ideal abstraction we envision for data scientists, the system is adaptive, able to exploit monitoring data to learn about workloads, and process user requests into a tailored execution context. In this work, we study different techniques that have been used to make steps toward such system awareness, and explore a new way to do so by implementing machine learning techniques to recommend a specific subset of system configurations for Apache Spark applications. Furthermore, we present an in depth study of Apache Spark executors configuration, which highlight the complexity in choosing the best one for a given workload.

Keywords/Search Tags:

Computing, Distributed, System, Data

Related items

1	Agent-Oriented Intelligent Distributed Computing And Its Applications
2	Research Of Several Key Techniques On Distributed Data Processing
3	Antnest: A Distributed Computing System Supporting Multiple Computational Modals
4	Studying Recommender Systems to Enhance Distributed Computing Schedulers
5	A System For Distributed MD Data Analysis Based On Spark
6	Research On Optimization Of Map Reduce For Interactive Analysis On Big Data
7	GPU Computing In Massive Data Processing
8	Design And Implementation Of Data Visualization System For Distributed Offline Computing Platform
9	Image compression and data replication in distributed computing systems
10	Design And Implementation Of Distributed Graph Computing Engine