Font Size: a A A

Novel Abstractions for Data Center Network Management

Posted on:2017-03-15Degree:Ph.DType:Thesis
University:The University of Wisconsin - MadisonCandidate:Gember-Jacobson, AaronFull Text:PDF
GTID:2448390005478329Subject:Computer Science
Abstract/Summary:
Data center failures have become increasingly problematic due to the plethora of critical web and storage services hosted in today's data centers. Frequently, the problem lies in the data center network, which is prone to both functional and performance failures caused by hardware or software faults, misconfiguration, overload, or other issues with links and devices.;Preventing such failures is challenging, because data center network operators lack a formal understanding of how their design and operational decisions impact the frequency of network problems. Furthermore, current frameworks for verifying and maintaining the functionality and performance of data center networks are incomplete and/or inefficient. Consequently, this thesis explores how to analyze an organization's network management practices and efficiently guarantee that a data center network functions correctly and offers reasonable performance amidst changes in infrastructure, configuration, and workload.;We first present the design of a management plane analytics (MPA) framework which uncovers the relationships between network management practices and the frequency of network problems. By applying MPA to over 850 data center networks operated by a large online service provider, we identify several practices that strongly impact the frequency of problems in these networks, including: the number of control plane configuration changes and the number of device types (i.e., the presence of middleboxes).;Armed with this information, we explore how to design abstractions that aid in ensuring the correct and performant operation of a data center's control plane and middleboxes. We introduce an abstract representation for control planes that efficiently models a data center network's forwarding behavior under all possible link/device failure scenarios. This allows us to verify important functional invariants---e.g., traffic between subnets S1 and S2 always traverses a middlebox---three to five orders of magnitude faster than current verification tools. Additionally, we introduce a middlebox state management framework that allows network operators to realize a "one-big-middlebox" abstraction and avoid middlebox-induced functional and performance failures in the presence of hardware/software faults or overload. Our framework guarantees the safety and consistency of transferred/replicated middlebox state with minimal latency and resource overhead.
Keywords/Search Tags:Data center, Management, Failures
Related items