Font Size: a A A

Research On Compilation Techniques On The Stream Architecture

Posted on:2009-09-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:J DuFull Text:PDF
GTID:1118360278456579Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The stream architecture is a novel architecture that can address the problem of processor-memory gap. It uses abundant arithmetic units, an effective multi-level memory hierarchy, and multiple parallel techniques synthetically. It has shown tremendous performance advantages in the domains of media and signal processing. Scientific computing applications require very high arithmetic rates, and they often exhibit some characteristics of stream applications. Thus how to use the stream architecture to accelerate scientific computing programs is worth studying further. Existing stream compilers, which are media applications oriented, don't have corresponding optimization aiming at the characteristics of scientific computing programs, so that scientific computing programs can't achieve high performance. In order to make scientific computing programs exert performance advantages of stream architecture, we develop the SF95 stream programming language and the SF95Compiler stream compiler for scientific computing programs. This paper researches on compiler optimization techniques for scientific computing programs on stream architecture. Especially, three important components of the SF95Compiler framework are researched and implemented, including stream transformation, code optimization and scientific computing function library. Taking the Imagine stream architecture, which is a representative stream architecture, as the platform, we implement and verify some optimizing algorithms in the SF95Compiler stream compiler. Primary innovative works of this paper can be summarized as follows:(i) An optimizing streamization technique based on the D&C matrix is proposed. Streamization is a source to source transformation process that can transform SF95 programs to intermediate codes of stream programs. The simplest streamization is the direct streamization approach, which doesn't apply any program transformation optimization for the stream architecture. Therefore, the stream programs we obtained are inefficient. To address the problem, this paper studies the loop transformation and data transformation techniques for the stream architecture, and proposes an optimizing streamization technique based on the D&C matrix. The contributions of the research lie in the following three aspects. Firstly, a Data&Computation matrix (D&C matrix) is built to abstract the relationship between loops and arrays of the original programs. Secondly, three key techniques for optimizing streamization approach are proposed based on the transformation of the D&C matrix, i.e. coarse-grained program transformation, fine-grained program transformation and stream organization optimization. Finally, an optimizing streamization algorithm based on the D&C matrix is proposed. The experimental results show that the optimizing streamization technique based on the D&C matrix can achieve high instruction level parallelism and data level parallelism and enhance locality of LRF and SRF. Compared with the direct streamization technique, the optimizing streamization technique achieves an average speedup of 2.76.(ii) An optimal strip-mining technique based on a parameter model is proposed. Strip-mining technique is one of the most significant approaches for improving Stream Register File (SRF) bandwidth utilization on the stream processor. Achieving optimal strip size is the base of strip-mining technique. The optimal strip size is the length of the strip size that makes the program execution time minimal. Thus, it is critical to quantify the program execution time influenced by the strip size for achieving optimal strip size. In order to achieve the theoretical optimal strip size by modeling analysis, this paper proposes an optimal strip-mining technique based on parameter model, to minimize the execution time. The contributions of the research lie in the following three aspects. Firstly, according to the property and optimization approach of the given program, this paper builds a parameter model of program execution time based on the memory system with a single address generator. Secondly, based on the model analysis, this paper explores the optimal strip size selection approaches to the computation intensive programs and memory intensive programs respectively. Finally, an optimal strip-mining algorithm for any programs is proposed in terms of the above conclusions. The experimental results show that the parameter model based strip-mining technique can effectively hide and avoid the memory access latency, and improve the utilization efficiency of SRF.(iii) The optimization techniques for enlarging the overlap between computation and memory access within a kernel are proposed. Exploiting the overlap between computation and memory access within a kernel is an effective method for hiding the memory access latency and keeping the functional units saturated during stream processing. Since the computation time and memory access time of a single kernel can not be overlapped with each other, the single kernel is often partitioned to several small sub-kernels by partitioning the streams within the kernel, so that the computation time and memory access time of different sub-kernels can be overlapped. In order to maximize the overlap between computation and memory access within a single kernel, a combination of analytical and empirical approaches is used in this paper to propose three strip-mining strategies, e.g. empirical strategy, long stream strategy and optimal strategy. The three strategies are suitable for different scientific computing programs, and their compiler complexities increase in turn. Users can choose the appropriate strip-mining strategy, considering the specialties of the programs, and the different compilers' demands of complexity and accuracy of compiler algorithms, so as to insure the simpleness and near-optimality of the compiler algorithms. The experimental results show that the empirical strategy and long stream strategy are suitable for small-scale and large-scale scientific computing programs respectively, and the optimal strategy can be used to hide latency of any scientific computing programs.(iv) A reuse optimization technique between kernels for any scientific computing programs is proposed. Enhancing the stream reuse between kernels is an effective approach to improve SRF reuse and then avoid memory access latency. Existing stream compilers can not transform two kinds of stream reuse (long stream reuse and partial reuse) between kernels to SRF reuse, thus there is potential for reuse optimization further. To address this problem, we propose a reuse optimization technique between kernels for any scientific computing programs. The contributions of the research lie in the following three aspects. Firstly, this paper researches on the reuse of long streams between kernels restricted by the capacity of SRF, and proposes a reuse optimization algorithm of long stream between kernels. Secondly, an extended concept of stream reuse between kernels, namely partial reuse between kernels, is defined. Based on the new concept, this paper explores some kinds of partial reuse that can't be recognized by the stream compiler between kernels, and proposes a partial reuse optimization algorithm to recognize and capture all the partial reuse between kernels. Finally, combining the above two algorithms, this paper proposes a reuse optimization algorithm that isn't restricted by the SRF capacity and the kind of partial reuse between kernels. The experimental results show that the proposed reuse optimization algorithm between kernels can improve the locality of SRF dramatically, reduce the amount of memory accesses efficiently, and enhance the sustained supply of required streams.(v) The optimization and implementation of some typical functions in the scientific computing library are proposed. Statistical results show that some common kernel functions are widely used in most of the scientific computing programs. Therefore, implementation and reuse of these common kernel functions are crucial to develop the stream programs on the stream processor. Based on these kernel functions, we build a software inline scientific computing function library. The optimization and implementation of these functions determine the performance of the library. This paper takes five typical programs (Jacobi, GEMM, Transp, Laplace, Swim) in the scientific computing function library as examples, and demonstrates the optimization and implementation strategies of these programs on the stream processor according to the specialty of each program. The experimental results show that after optimization, all of the five scientific computing programs can achieve obvious performance improvement on the stream processor.
Keywords/Search Tags:stream architecture, scientific computing, SF95Compiler, D&C matrix, strip-mining, overlap between computation and memory access, SRF reuse, scientific computing function library
PDF Full Text Request
Related items