With the development of information technology and the rapid popularization of the Internet, people are making a significant shift from being receivers of information to becoming its producers. And with the explosive growth of information, the Internet has stepped into the era of big data. Nowadays big data has become an important strategic resource and a new decision-making method, and cloud computing provides powerful computing and storage capacity for the analysis and processing of big data. Hadoop is an open-source distributed computing platform of Apache Software Foundation. It can be deployed on ordinary commercial computers, with features of high fault tolerance and low cost. With the rise of big data and cloud computing, an increasing number of companies start taking advantage of Hadoop to provide cloud services. The ever increasing and large-scale deployments of high performance computing clusters bring in huge energy consumption. Cloud service providers should not only meet the needs of users according to the service-level agreements (SLAs), but also reduce resource costs as much as possible under the premise of ensuring the quality of service.In order to overcome the limitations of Hadoop 1.0 in reliability, scalability, and resource utilization, Hadoop 2.0 abstracts its functions of resource management into a general system called YARN. YARN can support a variety of computing frameworks (e.g., MapReduce, Spark, Storm) and achieve unified management of cluster resources, which has the advantages of high resource utilization, low operation and maintenance costs, and data sharing. As the most popular management system of cluster resources, YARN faces two severe challenges as well as new development opportunities:one is the ability to automatically tailor and control resource allocations to different jobs for meeting their deadline specified in SLAs, the other is minimizing energy consumption of the entire cluster based on deadline control. Therefore, the resource allocation and energy-efficient scheduling of Hadoop YARN in cloud computing environments has become a problem that needs to be studied and resolved urgently.To deal with the above problems and challenges, this paper proposes an SLA-aware energy-efficient scheduling strategy for Hadoop YARN. This strategy performs job profiling to obtain the performance characteristics for different phases of a MapReduce application in multi-tenant cloud computing environments, which will be considered in determining the two-stage task parallelism for the sake of meeting the completion deadline specified by the application’s SLA; then it uses an SLA-aware resource scheduler for dynamic resource allocation of each application to ensure that the task parallelism will not change at runtime; finally, it integrates DVFS technology to achieve energy savings by running tasks during slack times. In conclusion, the proposed energy-efficient scheduling strategy can reasonably allocate resources in multi-tenant cloud computing environments, and can minimize the energy consumption of cloud computing platforms under the premise of meeting deadlines.Network bandwidth has always been one of the bottlenecks which restrict the development of cloud computing, the SLA-aware energy-efficient scheduling strategy for Hadoop YARN makes the most of data locality in Hadoop for saving network traffic, and makes use of slack times produced by data transmission to achieve energy-efficient task scheduling. In addition, this paper uses CloudSim, a cloud computing simulation platform, to evaluate the performance of resource allocation and energy-efficient scheduling algorithms. Experimental results show that compared with the existing YARN resource scheduling scheme, the proposed energy-efficient scheduling strategy can make good use of hardware resources and achieve better SLA conformance with low resource cost and energy consumption. |