Fusion And Partition

Posted on:2007-03-14

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L F Ceng

Full Text:PDF

GTID:1118360215970492

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Parallelism is one of the effective technologies to improve the performance of computers. The technology of parallelism can be classified into intra-processor parallelism and inter-procesor parallelism. One of the trends of intra-processor parallelism is to support both instruction-level parallelism (ILP) and data-level parallelism (DLP). The Imagine stream processor developed by Stanford University is an ILP-DLP representative. Distributed shared memory (DSM) multiprocessor becomes a widespread inter-processor parallel architecture because of its scalability and straightforward programming model. The performance of both intra-processor and inter-processor parallelism is restrained by the latency and bandwidth of memory access. This is commonly called the Memory Wall problem. For inter-processor parallelism, false sharing is another obstacle to gaining high performance.To solve the Memory Wall problem, memory hierarchy has been used extensively in current computer systems. For intra-processor parallelism, the latency-oriented cache hierarchy is used in traditional processors, and the bandwidth-oriented stream hierarchy is used in stream processors. The cache hierarchy and the stream hierarchy are generally called the memory hierarchy in a processor (MHP). For inter-processor parallelism, distributed memory among processors (DMP) is used. The effectiveness of memory hierarchies is mainly determined by the optimization of memory-access-sequence. The memory-access-sequence of a program can be improved by instruction optimization and data-layout optimization. The data-layout optimization is recently more active than the instruction optimization.In this paper, several data-layout approaches are proposed to make full use of memory hierarchies, and to improve the performance of memory-access-sequence. For bandwidth-oriented stream hierarchy, linear kernel fusion approach is also researched besides data-layout approaches. The primary innovative work in this paper can be summarized as follows:1) DMP-oriented memory-access-sequence optimizationTo solve the performance problems of our OpenMP/SDSM compiler, the following two approaches are proposed.â‘ The large shared granularity (usually memory page) in SDSM often causes false sharing for programs with small shared arrays, and causes poor memory locality for programs with large-stride array access. Shared Array Fusion (SAF), merging shared arrays which always appeared together according to certain rules, is used to eliminate fale sharing and to improve memory locality. The experimental results show that SAF is effective. â‘¡SAF is not sufficient enough for SDSM with small shared granularity ,e.g., cache line. Array Fusion Based on Data Access Trace Alignment (BDATA-AF) is proposed. It can align the data access trace while merging arrays. This approach can increase the utility of small shared granularity. For SDSM with large shared granularity, BDATA-AF can align the distribution of shared arrays and then reduce the number of remote memory access. The experimental results show that BDATA-AF is more efficient than SAF.2) MHP-oriented memory-access-sequence optimizationTo make full use of the memory hierarchies in a processor, the following approaches are proposed.â‘ For the Cache hierarchy in traditional processors, the optimal implementation of the parallel syntax in Fortran is mainly researched, and Temporary Data Space Fusion (TDSF) is proposed to improve the performance of the object code. We implemented this approach in G95 open source compiler. The test results of G95 and Intel EFC compiler show that TDSF can obviously improve the performance of those FORALLs that include array assignment statements. And the performance of FORALLs can be further improved by combining TDSF and the loop sorting optimization.â‘¡For the stream hierarchy in stream processor, by analyzing the architecture and programming model of typical stream processor, Stream Partition (SP) and Linear Kernel Fusion (LKF) are proposed. The test results on Imagine simulator show that 1) SP can make the best use of the bandwidth and enhance the parallelism in stream processor. 2) LKF can reduce the number of kernel calls, and reduce the time which the stream spends in switching between various memory levels.

Keywords/Search Tags:

Compiler optimizations, memory-access-sequence optimizations, data-layout optimizations, memory hierarchies, distributed shared memory(DSM), false sharing, stream processor

PDF Full Text Request

Related items

1	Optimizations Of Memory-access For Stencil Computations On Shared-memory Multi-core Processor
2	Improving the performance of network file systems through memory optimizations
3	Shared memory optimizations for distributed memory programming models
4	Compiler Optimizations For Software-Managed On-Chip Memory
5	Language support and compiler optimizations for object-based software transactional memory
6	Research On Optimizations For In-memory Key-value Database With RDMA And NVM
7	Stream Processing Optimizations for Mobile Sensing Application
8	Design And Implementation Of Avoiding False Sharing Distributed Shared Memory Protocol
9	Speculative distributed shared-memory multiprocessors organized as processor-and-memory hierarchies
10	Compilation Techniques And Compiler Optimizations For Dataflow-Like Driven Tiled Processor Architecture