Font Size: a A A

Toward software-defined Big Data Processing Systems

Posted on:2016-11-20Degree:Ph.DType:Dissertation
University:University of FloridaCandidate:Yu, ZeFull Text:PDF
GTID:1478390017981136Subject:Computer Engineering
Abstract/Summary:
During the last decade we have witnessed an explosive increase in the size of data. Transforming massive dataset into useful information is a critical but difficult task for data scientists. To make big data processing efficient and easier, a wide range of Big Data Processing Systems (BDPS) have been developed. These systems consume a large portion of resources in the datacenter. As a result, efficient resource management and task scheduling is important, both to BDPSs and the entire datacenter.;BDPS relies on close interaction with the underlying infrastructure to exert efficient resource management. However, traditional datacenter infrastructure is ossified and opaque, making such interaction difficult to implement. Lacking a common interface to query the information about the infrastructure, it is hard for BDPS to efficient mapping tasks to appropriate resources; lacking a unified API to control the resource, BDPSs have to implement all kinds of work-arounds to approximate the required control over the resources.;Fortunately, software-defined infrastructure opens the black box of the datacenter infrastructure and enables efficient and flexible interaction between BDPS and infrastructure. In this dissertation, I argue that it is beneficial to incorporate two key design principles of software-defined infrastructure into BDPS: (1) leveraging the dynamic and global view of the software-defined infrastructure to provide information query API to BDPSs, and (2) leveraging the programmability of the software-defined infrastructure to exert direct and dynamic control over the resource, according the each BDPS's requirement.;As a proof-of-concept, I first apply Software-Defined Network (SDN) to MapReduce, by automatically and dynamically exposing the network locality information to MapReduce. Based on this, I have implemented several key optimizations that are difficult, if possible, to implement without using SDN. Furthermore, I also propose a Software-Defined Computing abstraction called resource balloon for dynamic and fine-grained resource sharing. Evaluation shows the promising advantage that can be obtained from this novel incorporation.;To further illustrate how BDPS can benefit from Software-Defined Infrastructure, I implement an application-controlled cache/prefetch architecture that coordinately leverages this novel infrastructure to exert application-aware cache policies to solve non-local straggler problem arising in data-parallel computation systems like MapReduce. Extensive experiments show that applications can significantly benefit from this application-awareness.
Keywords/Search Tags:Data, Software-defined, Systems, BDPS
Related items