Font Size: a A A

Research On The Design Techniques Of Cluster-on-Chip Architecture

Posted on:2011-12-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:L B HuangFull Text:PDF
GTID:1118330332487007Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Multicore architecture has become the neccessary approach for improving processor performance in accordance with Moore's Law. However, as core number increases and the chip heterogeneousity expands, the rapid performance improvement would not impose the flexibility of multicore resources management and the simplicity of application programming. On the contrary, the complexity of multicore structure makes it difficult to utilize a large number of chip resources efficiently. To resolve the contradiction between growing computing power and relatively backward multicore computing model and management, this paper introduces HPC cluster technology into the multicore architecture design, resulting in cluster on-chip (CoC) computing paradigm. It can provide well support for efficient multicore organization and utilization. To carry research on CoC, We start from the underlying hardware aspect and work on different implementation levels, including CoC hardware structures, parallel programming model. This work will explore key technologies to make the performance of multicore maintain Moore's Law. The main contributions of this paper are as follows:1. We put forward a low-cost high-performance floating-point SIMD accelerator architecture for the CoC computing node. The shared low-cost design method for fixed-point SIMD accelerator is mature, but it is not the case for the floating-point SIMD accelerator, which remains the simple replication design method. We present the first floating-point SIMD accelerator architecture with the hardware shared design. The original double-precision datapath can be segmented to support two single-precision parallel operations. In addition, to address the increasing demand for 128-bit floating-point computation, this paper also proposes the 128-bit SIMD low-cost hardware design. The experimental results show that the proposed SIMD accelerator structure can effectively reduce hardware cost and power consumption.2. We propose an efficient data-parallel accelerator architecture for overcoming its performance bottleneck. There are many obstacles existing in data parallel architecture such as non-aligned access, data permutation, and control flow, causing it unable to reach the theoretical performance as expected. In this paper, after quantifying the characteristics of data permutation operations in the vectorized applications, we propose the IDP mechanism along with its hardware structure and compiler strategy. It can significantly reduce the explicit data permutation operations and effectively overcome the permutation bottleneck of data-parallel accelerator. In addition, the vectorized loop buffering mechanism, which eliminates the vector loop control and address calculation overhead is also proposed to effectively perform the loops in data-parallel accelerator. Based on above techniques, an efficient parallel high-performance multimedia accelerator, called MCP, is introduced. 3. We present an efficient on-chip network architecture for CoC. The classic network on chip design only optimized for long unicast communication and usually has power and latency defects. This is ackward for some important applications such as cache coherence protocols, SIMD computations and so on, requiring extensive multicast or broadcast communication. This paper presents hierachical virtual bus interconnection structure. Based on the existing datapath of network link, we reconstruct the virtual bus dynamically upon request. It can provide low latency unicast and multicast / broadcast communication services. We also propose the hardware scheme for supporting hybrid shared memory/message passing programming model and design the appropriate memory hierarchy and coherence protocol, which is seamlessly compatible with existing MPI and OpenMP programs.4. We design an efficient hybrid parallel programming model for CoC. It exploits various levels of parallelism. In addition, based on the consideration that SIMD accelerator can achive higher performance at lower hardware cost and power consumption than multicore architecture, we introduce the loop-based streamization programming model (LSM) for the data-parallel accelerator existing in CoC computing node. Similar to incremental OpenMP programming, it can reduce the complexity of tranditional stream programming and reduce the programming burden on the programmer greatly. At the same time, the hardware extension of GPP processor is also proposed. The experiments show that it can efficiently utilize data parallelism accelerator and gain great performance improvement.CoC architecture design is a new topic. Current research works only remain at its concept and do not touch at the concreat desgin. In this paper, we carried on the research from three aspects for CoC design: data-parallel core design for CoC node, CoC chip network architecture and CoC parallel programming model. The realization, verification and evaluation results of this paper show that these techniques are effective, and can be used in future multi-core microprocessor design and implementation.
Keywords/Search Tags:Cluster-on-Chip, Multicore, SIMD, Implicit data permutation, Virtual bus on chip network, Hybrid programming model
PDF Full Text Request
Related items