Font Size: a A A

A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy

Posted on:2014-12-24Degree:M.SType:Thesis
University:Arkansas State UniversityCandidate:Zhang, YuluFull Text:PDF
GTID:2458390008453247Subject:Computer Science
Abstract/Summary:
Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many scientific applications. Various implementations have been explored at different levels. However, as GPU's gain an expanding role in high performance computing, there is a need for a more effective checkpoint/restart scheme that does not yet exist due to GPU's batch-mode execution manner. The GPU's complex memory hierarchy also means the states are scattered in different memory locations that are difficult to fetch. Programs that are running in parallel make the states difficult to construct for each thread.;The thesis proposes an application-level checkpoint/restart scheme to save and restore GPU computation states. A precompiler and a run-time support module have been developed to construct and save states in CPU system memory dynamically. Memory blocks are registered, and new data structures are proposed to save and restore the computation states represented by variables and pointers in the GPU. Secondary storage can be utilized for scalability and long-term fault tolerance. CUDA applications with complicated memory use are support as well. Experimental results have demonstrated the effectiveness of the proposed scheme.
Keywords/Search Tags:Memory, Applications, Scheme
Related items