Font Size: a A A

General-purpose Multi-core Cluster Parallel Tuning Strategy Research

Posted on:2012-02-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:P WangFull Text:PDF
GTID:1118330371965415Subject:Computer architecture
Abstract/Summary:PDF Full Text Request
Nowadays, clusters constituted of Multi-Socket Multicore servers and high speed cLANs such as infiniband have become mainstream high performance computing platforms. The rising of such kind of parallel computing platforms has a significant impact on performance tuning, because tuning application performance on such platforms is very difficult and most work is done by hand due to the software-hardware gap. The main problem lays ahead of programmers is how to tune a given application on a target Multicore cluster. This paper tries to solve this problem by developing a parallel performance tuning scheme which can synthesize all kinds of possible optimizations efficiently.The target applications of our scheme are FMM and Stencil computing. Firstly, the scheme divides all factors affecting application performance into four classes: computation, memory access, communication and load balance (referred to as P, M, C, B in sequence). As different types of applications are characterized by different PMCB patterns, our tuning scheme develops tuning procedure for FMM and Stencil computing respectively. In FMM applications, all important characters that can affect performance can be figured out by hand precisely. For such kind of applications, our scheme use a static analysis based tuning algorithm to synthesize all optimizations and the experiment results are satisfying. In stencil computing, computation pattern is essentially independent of memory access and communication pattern, which means the computation related optimizations can be determined independently first. Then our scheme tries to determine optimizations belonging to M as much as possible. After this step, there remain some M and C optimizations which interact with each other and are hard to be figured out by static analysis. For these optimizations, we develop a suite of microbenchmarks to simulate memory access and communication patterns that occurs in target applications. We use these microbenchmarks to compare all possible optimization combinations and choose the one with best performance. As a result, our scheme works well. It can deliver 95% performance of auto-tuning while spending only 10% amount of auto-tuning time because the microbenchmarks can effectively depict interferences of optimizations and avoid simulate the whole application.As to FMM and some target stencil computing programs, load balance just means distributing computation, memory access and communications equally among all computing nodes. But for Line-Sweep computing, which is also a stencil computing, a load-balance scheme that guarantees every processor the same amount of computation, memory access and communication cannot always achieve the best overall performance because it brings up too much memory access and communication under some conditions. To overcome this problem, we propose a new data partition scheme named Balance-Partition for Line-Sweep algorithm. Our scheme can effectively balance the cost of computation, memory access and communication by reducing unessential constraints. This algorithm contains three main key techniques. Firstly, a performance model is designed. In this model, the performance will only be determined by the way it divides data. And then, we reduce the search space of Balance-Partitions based on this model and find the best way to divide data. Finally, we design a processor assignment function to generate a Balance-Partition. Experiment results show that when the best Balance-Partition and the best Multi-Partition share the same way to divide data, their performances are almost the same. However, when they divide data differently Balance-Partition outperforms Multi-Partition significantly.
Keywords/Search Tags:auto-tuning, static analysis, microbenchmark, data partition, load balance
PDF Full Text Request
Related items