Font Size: a A A

Research On Accelerator-centric Programming Model And Optimizations For Heterogeneous Computing Systems

Posted on:2018-01-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:C ChenFull Text:PDF
GTID:1368330623950329Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
For the benefit of computing capacity and power efficiency,heterogeneous computing has become a hot research.As heterogeneous system introduces special purpose coprocessors,such as graphics processing units(GPUs)and many integrated cores(MICs),the computing components have different architectures,instruction sets and the memory of coprocessor is separated from the host main memory.Programming and performance optimization on heterogeneous systems is difficult.Meanwhile,due to the increasing number of involved components in large scale heterogeneous clusters,it is essential to enable fault tolerance to improve the reliability of the whole system.To address these issues,this paper studies on the parallel programming,performance optimization and fault tolerance technology.The paper has presented a new programming model,reverse offload(r Offload)programming model.Instead of taking the accelerator for compute intensive parts,we use CPU as an irregular parts accelerator and we Leverage the PCIe by launching from coprocessors,then the control intensive parts are offloaded to CPU.Based on this model,we also provide the programming interface and prototype construction on a CPU-MIC system.Focus on this model we provide load balance and task scheduling optimization.For the reliability is one of the major concerns,we design and implement a fault tolerance frame-work for hybrid programs that leverage heterogeneous hardware architectures based on the in-memory checkpointing technique.HybridLinpack is proposed and implemented in a CPU-MIC heterogenous systems Tianhe-2 to verify the efficiency of our work.The contributions of this thesis could be summarized as following:1.With the introduction of the Intel MIC which acts as an autonomous compute node with its own IP address,we address the issue of programmability by developing r Offload programming model for CPU-MIC systems.Instead of using MIC for compute-intensive parts of applications,we use CPU as accelerator for irregular codes.We also develop a framework that provides the means to offload computations from MIC to a host CPU through a unified API.To better handle the case of MPI communication,we also provide a MPI proxy on the MIC that is completely transparent to programmers.Building on a Many Integrated Core Platform Soft-ware Stack(MPSS),we propose a compiler independent runtime framework that enabled our offloading patterns.We use both Intel's Symmetric Communication Inter Face(SCIF)and standard MPI as its back-end to enable the task offload and the introduced communication.2.In order to reduce the computing bubbles of MIC's execution when programming on r Offload.The task division algorithm based on the performance model is proposed to calculate the proportion of the task balance division,so that to decrease the execution time difference of each sub-division as much as possible.So as to achieve the purpose of equalizing the execution time on each computing unit and avoid waiting for the same time,we provide the scheduling algorithm according to the task DAG to meet the dependency constraints on the basis of computing resources.Different tasks are sorted in time and space to reduce the execution time of total tasks.3.To enable fault tolerance to improve the reliability of the whole system,we propose a double in-memory checkpointing protocol for the CPU-MIC clusters which can tolerate failures in both Offload and reverse-Offload scenarios.We implement new capabilities for offload application that support the double in-memory checkpoint/restart scheme and implement the regular checkpoint saving/loading functionality.Two efficient optimization techniques for further improvements,an asynchronous concurrent checkpointing file writing technique and a peer replacement strategy for failed nodes.We also propose two efficient optimization techniques for further improvements,an asynchronous concurrent checkpointing file writing technique and a peer replacement strategy for failed nodes.4.To validate the programming model and framework,the paper employs hybrid-LU factorization on a heterogeneous system Tianhe-2.The evaluation result shows that our work can alleviate pressure on PCIe transfer and achieve positive evaluation in terms of performance.The r Offload implementation of Linpack outperforms the Offload one about 7% on a single node and about 12% on 64 nodes.
Keywords/Search Tags:Heterogeneous system, MIC, Programming model, Performance optimization, Fault tolerance
PDF Full Text Request
Related items