Font Size: a A A

Research On Architecture Of Multi-core Processor For High-Density Computing

Posted on:2012-04-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:H T ZhuFull Text:PDF
GTID:1118330335462386Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The demand for computing capability not only promotes the development of high performance computing technology, but also promotes the development of microprocessor technology. Currently, microprocessors have been commonly used in super computers, thus plays an important role in supercomputers and high performance computing. In high performance computing, there is a large class of applications which are computing and memory access intensive, such as scientific computing, engineering computing, numerical simulation and signal processing. How to analyze and optimize the performance of such applications on multi-core processors has been a key research issue concerned by many researchers.Supported by the project of domestic high performance multi-core processors in China, this thesis conducts in-depth research on performance analysis method of general-purpose processor, architecture optimization, and performance optimization of high-density computing in the context of high density computing.The main investigations and innovations are as follows:1. A performance analysis method for guiding architecture optimizationIt is difficult to describe the specific impacts of architecture parameters on performance by existing methods based on the simulation fitting or modeling in high level, so the effect on guiding architecture optimization is limited. In order to reveal the relationship between the performance of matrix multiplication and architecture parameters of processor, this thesis builds performance model of matrix multiplication for single-core and multi-core through analyzing matrix multiplication computation/memory access behavior and processor architecture characteristics. Based on the performance model, some necessary conditions are given in terms of constraints of architecture parameters for gaining optimal performance of matrix multiplication. Two theorems on optimization of architecture parameters, including the lower bounds of register number and memory bandwidth, are deduced. The model can help to find the bottleneck in processor architecture, and guide processor architecture optimization. Finally, the model has been verified on Intel core i7 processor and Godson-3A processor. To be specific, the accuracy exceeds 90% and 86% respectively for 4 core matrix multiplication.2. A fusioned madd-shuffle floating vector instructionBased on the proposed performance model, this thesis carries out performance analysis of programs running on the processors with vector extensions. According to the analysis, a large number of data shuffle instructions are needed when designing applications based on processors with vector extensions, which greatly affect the performance. To cope with the problem, this thesis proposes a new vector instruction, which combines both of vector madd instruction and shuffle instruction. This new instruction can completely eliminate the shuffle instruction, and reduce the program length by more than 33%. Compared with the kernel program with common vector shuffle instruction, the performance of kernel program with new vector instruction can increases by 33% or more, and reduces power overheads.3. A new decoupled access/execute architectureTo satisfy the requirement of high-density computing applications, this thesis proposes a new decoupled access/execute architecture for solving the bottleneck generated by memory access. Based on the traditional decoupled access/execute architecture, a memory access coprocessor is added to GPP. When a normal application is running, existing memory acess system is used; when high-density computing is running, the co-processor is responsible for data transfer between registers and L2 cache/memory, or prefetching data for hiding memory access time, which improves performance. The use of memory access coprocessor efficiently hides memory access latency, and doubles memory access banwidth compared with Godson-3A.4. Mapping an efficient matrix multiplication base on Godson-3B processorTo gain high performance matrix multiplication on Godson-3B processor, this thesis analyses memory access characteristics of each matrix in matrix multiplication, and uses different methods to optimize the memory access behaviors, for hiding memory access time. The performance of optimal matrix multiplication achieves 119.0Gflops and the efficiency is 93.0%, which is more than 10 times better than Godson-3A. The performance/power ratio is 2.98 Gflops/W, which is better than current mainstream processors.
Keywords/Search Tags:High-Density Computing, Multi-Core, Performance Analysis Method, Architecture Optimization, Performance Optimization, Fusioned Instruction, Decoupled Access/Execute, Matrix Multiplication
PDF Full Text Request
Related items