Font Size: a A A

Compilation-Based Memory Optimizations For Scientific Computing Applications On Stream Processors

Posted on:2009-01-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:1118360278456604Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The stream processor is a good choice for customized accelerators of high productivity computing systems, due to its high ALU density, low power and flexible programmability, and has been successfully applied to No.1 current IBM Roadrunner system in TOP500 list published in 2008. However, when processing scientific applications, the stream compiler, built for media applications originally, shows bad applicability, which makes the memory access decide the performance of stream processors. Therefore, compilation-based memory optimizations for scientific applications on stream processors have become the key to improving the performance of stream processors and releasing the"Memory Wall"problem.This thesis studies compilation-based memory optimizations for scientific applications on stream processors which has three-level memory hierarchies: local register files (LRF), stream register file (SRF) and off-chip DRAM. Exploiting the locality of on-chip memory, hiding the long memory access latency and avoiding the SRF overflow can improve the memory performance of stream processors. Hence, enhancing the LRF and SRF locality, hiding the long memory access latency and avoiding the SRF overflow are lucubrated in this thesis. The main contributions of this thesis are as follows.1. The stream compiler exploits the reuse of records distributed on different ALU clusters by introducing inter-cluster communications, which degrades the kernel performance. The thesis presents the Stream Transpose (ST) approach to exploit such reuse. The approach, by reorganizing the data, puts data that have been distributed on neighboring ALU clusters on the same ALU cluster, hence exploiting the reuse without any inter-Cluster communications. Furthermore, in order to avoid the memory bank conflict brought by the reorganization, the ST approach forcasts when the conflict happens and avoids it by loop-splitting. The experimental results show the approach can exploit the reuse of records distributed among ALU clusters without any inter-cluster communications or any performance degradation of memory access, and gains at most 1.46 speedup over the approach with inter-cluster communications.2. Exploiting the reuse of streams in the SRF can enhance the SRF locality. However, to our best knowledge, there is still no automatic approach to exploit the reuse supplied by stream references with variable start or end bounds. This thesis presents the Constant-Bound Stream Replacement (CBSR) approach, the first approach exploiting the whole reuse among variable-bound streams automatically. Some novel theories, algorithms and mechanisms are proposed in CBSR, such as the theory for recognizing variable-bound stream reuse, the first defined and built Stream Reuse Graph (SRG), based-on SRG Stream-Level Program Transformation (SLTP) algorithm and the SRF pressure evaluation and release algorithm. Experimental results show the approach exploits the reuse of variable-bound streams effectively, enhances the SRF locality, and achieves 1.14~2.78 speedup over the stream scheduling approach of the stream compiler.3. The partial reuse happens, for two stream references accessing the shared portion of values, when the second reference can reuse the part generated by the first reference. The problem of exploiting partial stream reuse does not have its parallel in the traditional data reuse exploitation (for scalars and arrays). This thesis presents the Extended CBSR (E-CBSR) approach, the first approach to exploit such reuse, including defining and quantifying partial reuse, recognizing when it happens, improving the SRG for describing partial reuse and extending the SLTP algorithm for exploiting partial reuse. E-CBSR approach can exploit both whole and partial reuse, eliminate the redundant load of streams and improve the SRF locality. Experimental results show E-CBSR gains 1.27~2.54 speedup.4. Prefetching streams can hide the long memory access latency. However, any resource allocation conflict leads to prefetch failure. This thesis presents a SRG-based SRF Allocation Conflict Avoidance (SRFACA) algorithm to solve this problem. Compared with existing algorithms, the SRFACA algorithm avoids the prefetching failure with lower SRF overheads, hence improving the success of prefetching. Experimental results show the SRFACA algorithm gains at most 1.88 speedup, based on the results of the E-CBSR approach.5. SRF overflow may destroy the benefits of exploiting stream reuse and prefetching streams. The current stream compiler employs strip-mining to address this problem, which demolishes exploiting the cross-iteration reuse and stream prefetching nevertheless. This thesis presents a SRG-based loop-tiling algorithm. Compared with strip-mining, our algorithm can compute accurate SRF requirement of programs, automatically select tile size, and avoid the SRF overflow, with the stream reuse exploited and streams prefetched. Experimental results show our algorithm gains 1.12~2.61 speedup over strip-mining.
Keywords/Search Tags:Stream, Stream Programming Model, Stream Processor, Record Reuse, Whole Reuse, Partial Reuse, Prefetch, Loop-Tiling
PDF Full Text Request
Related items