Font Size: a A A

Towards improving performance and reliability of Cloud platforms

Posted on:2014-12-27Degree:Ph.DType:Thesis
University:The Pennsylvania State UniversityCandidate:Sharma, BikashFull Text:PDF
GTID:2458390005498754Subject:Engineering
Abstract/Summary:
Cloud computing refers to both applications delivered as services over the Internet, as well as the hardware and system software in the data centers, that provides these services. It has emerged as one of the most versatile forms of utility computing, where both applications and infrastructure facilities can be leased from an infinite pool of computing resources in the form of fine-grained pay-as-you-go mode of billing. Over recent years, there has been an unprecedented growth of cloud services from various key cloud providers like Amazon, Google, IBM and Microsoft. Some of the unique characteristics that make cloud computing so attractive and different from traditional distributed systems like data centers and grids include self-organization, elasticity, multi-tenancy, infinite resource pool availability and flexible economy of scale.;With all the promises and benefits offered by cloud computing, also comes the associated challenges. Amidst various challenges, performance and reliability related issues in clouds are very critical and form the main focus of this dissertation. Specifically, the focus of this dissertation is to address the following four specific performance and reliability concerns in clouds: (i) lack of representative cloud workloads and performance benchmarking that are essential to evaluate and assess the various characteristics of cloud systems; (ii) poor management of resources in big data cloud clusters running representative large scale data processing applications like Hadoop MapReduce. Specifically, this is due to the problems associated with the static, fixed-size, coarse-grained, and uncoordinated resource allocation in Hadoop framework; (iii) inefficient scheduling and resource management of representative workload mix of interactive and batch applications, running on hybrid data centers which consist of both native and virtual machines; and (iv) lack of end-to-end effective problem determination and diagnosis framework for virtualized cloud platforms that is quintessential to enhance the reliability of the cloud infrastructure and hosted services.;Towards this pursuit, this dissertation addresses the above mentioned performance and reliability specific problems in clouds, explores the underlying motivations, proposes effective methodologies and solutions, conducts exhaustive evaluations through comprehensive experimental and empirical analyses, and lays foundations for future research directions. The first chapter of the dissertation focuses on the characterization and modeling of cloud workloads. In particular, the thrust is on the modeling and synthesis of an important workload property called task placement constraints, and demonstrates their significant performance impact on scheduling in terms of the incurred task pending delays. The second chapter describes an efficient dynamic resource management framework, called MROrchestrator, which alleviates the downsides of slot-based resource allocation in Hadoop MapReduce clusters. MROrchestrator monitors and analyzes the execution-time resource footprints of constituent map and reduce tasks, and constructs run-time performance models of tasks as a function of their resource allocations, thereby improving the performance of applications and boosting the cluster resource utilization. The third chapter proposes HybridMR, a hierarchical MapReduce scheduler for hybrid data centers. HybridMR operates in a 2-phase hierarchical fashion, where the first phase guides placement of MapReduce jobs on native or virtual machines, based on the expected virtualization overheads. The second phase manages the run-time resource allocations of interactive applications and collocated batch MapReduce jobs, with an objective to provide the best effort delivery to the MapReduce jobs, while complying with the Service Level Agreements (SLAs) of the interactive services. Finally, the fourth chapter addresses the reliability of clouds in the context of efficient problem determination and diagnosis in virtualized cloud platforms, through a novel fault management framework, called CloudPD. A 3-level hierarchical architecture is adopted by CloudPD, consisting of a light-weight event generation phase, comprehensive problem determination phase, and an expert knowledge driven problem diagnosis phase, to provide accurate and fast localization of various anomalies in clouds.;Overall, the dissertation upholds the fact that performance and reliability issues in cloud computing environment are very critical aspects, and need to be well tackled through effective novel research and methodological evaluation.
Keywords/Search Tags:Cloud, Performance, Applications, Services, Resource, Data centers
Related items