Font Size: a A A

Research & Implementation Of Large-Scale Resource Management Technology For High Productivity Computing

Posted on:2010-01-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y T LuFull Text:PDF
GTID:1118360278456534Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The technology trend in supercomputing is changing from purely pursuing the peak performance to comprehensively pursuing the high productivity. High productivity computing system (HPCS) aims to improve the programmability, portability, robustness of the system, and reduce the development, running and maintenance costs. However, due to the various features such as very large scale, complex and heterogeneous architecture, next-generation teraflops and petaflops systems face some vital challenges when aiming at implementing the high productivity target. Specifically, these challenges include how to improve the sustained performance, reliability, scalability, flexibility, and how to significantly reduce the power consumption during the overall design. Particularly, these challenges have become several critical research issues in large scale resource management system (RMS) of HPCS.Our research work is based on the implementation of the large-scale resource management system for our own high performance computer system which has the Scalable Shared Memory Processing (S2MP) architecture. Focusing on the development of high productivity resource management system for large-scale parallel systems, in this thesis, we systematically investigate some key techniques in efficient resource model, scalable RMS architecture, optimized scheduling policy, fault-tolerance job management, and power management and other related techniques. The main contributions of this thesis are as follows:1. A deep resource information model (DRIM) for the large-scale parallel computing system, has been proposed. DRIM not only addresses the disadvantage of the coarse grain resource definitions in traditional resource management systems, but also provides more comprehensive and realistic resource objects. Specifically, DRIM establishes entity model, function model and application model, which can accurately characterize the computing resources, communication resource, storage resource and different types of applications. DRIM also abstracts the relationship between the resources to make the management policy more effective and the management capability more viable. In a word, DRIM could provide powerful support for the job scheduling and resource allocation in RMS.2. A dynamic cascade resource management architecture has been proposed to create the cascade services dynamically based on self-organization mode. A light-weight optimized transportation protocol has been designed to reduce the management overhead and optimize the communication performance of control messages. A fast job-launching mechanism has been presented by using low-level hardware communication mechanism and collective operations. These could improve the scalability of RMS. The component-based system architecture has been used to support the function scalability of RMS. MCRM, Multiple Case Resource Management system, has been realized for the system with S2MP architecture. The experiments on a S2MP system with 2048 processors show that MCRM has a better scalability.3. An integrated-priority scheduling policy has been proposed, which considers various factors of job attributes, resource attributes and service attributes in system, it can promote the flexibility and efficiency of the scheduling mechanism. MC-backfill scheduling policy has been designed, which could adjust the backfill depth and frequency according to the status of the job queue. MC-backfill can not only improve system throughput, but also consider system fairness. The experiment results show that with MC-backfill policy, even in the case of inaccurate estimation of job running time by users, the average waiting time of jobs can decrease, and the throughput of system is improved.4. A model for the fault-tolerance job running time using checkpoint/restart technique based on Weibull failure distribution model for high performance computing system, has been proposed. Algorithms for calculating the best checkpoint interval and selecting the best collection of processors have been designed to increase the reliability of the system. An automatic job recovery mechanism has been implemented for the S2MP system. With checkpoint, the jobs can recovery automatically when system failure occurs. This method can avoid manual intervention, reduce the average time of fault recovery and increase the availability of system.5. Two approaches for power management has been proposed for the large-scale RMS. An algorithm for properly scheduling jobs and allocating resources under the constraints of system energy consumption has been presented as the system-level approach. A model of Feedback based Two-Level Power Management (FTLPM) has been presented as the application level approach, which can reduce the redundant parallelism in the applications to decrease the energy consumption. FTLPM combines the linear control model and fuzzy control model to control the concurrency of threads and processes according to the memory bandwidth of multi-core processor and I/O bandwidth of file system. The experiment results show the effectiveness of our approaches.
Keywords/Search Tags:High Productivity Computing, Resource Management System (RMS), Scalability, Reliability, Power Management
PDF Full Text Request
Related items