Font Size: a A A

Studies On The PIM Architectures And Techniques For Scientific Applications

Posted on:2008-04-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:P WenFull Text:PDF
GTID:1118360242999353Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Current high performance computer systems usually adopt decoupled architectures in which the processor and memory is separated and connected by a hierarchy of caches and complexly interconnection systems. In these processor-centric designs, the processor and memory use high-volume semiconductor processes and is highly optimized for the particular characteristics and features demanded by the product, e.g. high switching rates for logic processes, and long retention times for DRAM processes. With the progress of semiconductor and manufacture techniques and the fast development of processor architectures, the speed of processors has far exceeded the speed of memories, the processor-memory gap become larger and larger, which leads to the occurrence of "memory wall". And these processor-centric designs invest a lot of power and chip area to bridge the widening gap between processor and main memory speed.PIM (Processor-In-Memory), which merging processor and memory into a single chip, has made the processor and memory reunited and has the well-known benefits of allowing high-bandwidth and low-latency communication between processor and memory, and reducing energy consumption. As a result, many different systems based on PIM architectures have been proposed. The researches are mainly focused on PIM micro-architecture, PIM parallel system, PIM program model, and PIM compiler optimization. One common in these researches is to maximum exploit the benefits of high-bandwidth, low-latency of PIM architecture.Our researches about PIM architecture techniques are mainly stressed on two aspects: one is about the PIM micro-architecture, that is to say, finding the suitable processor architecture in PIM and making the best of the benefits that the PIM architecture supplied. The other aspect of our researches stressed on is relative to the PIM parallel system. After researching on the scientific computation-oriented PIM architectures, a Vector-based Processor-In-Memory (V-PIM) architecture, which coupled with the characteristics of vector processing and PIM architecture, is proposed, the parallel system based on V-PIM is presented, and the software optimal technology is discussed. Primary researches and innovative work in this paper can be summarized as following:1. Put forward and design V-PIM architecture the Vector-based Processor-In-Memory architectureV-PIM is a Vector-based Processor-In-Memory architecture. Vector architecture has a mature program model and powerful ability to express data parallelism explicitly. And PIM architecture has the high-performance memory system. So it's nature to union the vector and PIM architecture together. After comparing the register-register vector architecture and memory-memory vector architecture based on the utility of performance to area (performance/area), the results show that the union of memory-memory vector architecture and PIM architecture is superior to that of register-register vector architecture and PIM for that it has lower power and better on-chip resource utility. We adopt memory-memory vector architecture in our V-PIM design. This paper describes the designation of V-PIM architecture, presents the extended vector instruction set, and verifies the V-PIM architecture by FPGA-based platform.2. Propose the V-Parcels communication mechanism for V-PPIM parallel systemThe communication sub-system is important to the computing efficiency, scalability and suitability of parallel system. For reduce the communication traffic and improve the computing performance of vector exectution, V-Parcels communication mechanism for V-PPIM parallel system is proposed. Supporting vector operations transfer between V-PIM nodes is the main characteristic of the V-Parcels. Based on the analysis of vector elements distribution, it can dynamically generate V-Parcels communication package to transfer data or operations so as to local the computation, minimum the communication, and maximum the computing performance.3. Compile-time thread distinguishment algorithm on V-PIM-based architectureOn V-PIM-based architecture, the low temporal locality thread running on V-PIM processor is called Light-Weight Thread (LWT), while the low cache miss rate thread running on host processor is called Heavy-Weight Thread (HWT). The way of thread distinguishment can impact the system performance directly. For improving the system performance, we need a suitable thread distinguishment algorithm. Based on the execution performance on V-PIM and host of a thread, we present a compile-time method to distinguish the LWT and HWT. Once the compiler identifies the type of the thread, it can schedule the thread to the proper processor and accelerate the system performance. The thread distinguishment algorithm is simple and easily implemented, and the running result approaches the real situation.4. Put forward COPE architecture a composite organization of PIM and multiple computing clusters for push executionWe present COPE (Composite Organization for Push Execution), a new PIM architecture that combines PIM memory and multiple execution clusters on a chip to overcome the challenges of power, wire latency, memory wall, and so on that facing the future teraflops chips. In the memory-centric COPE architecture, the PIMs play the role of smart memories, and the multiple execution clusters play the role of processing units. The data is pushed to the execution clusters and executed by execution clusters. Multiple execution clusters are interconnected by on-chip operation network. As smart memory, PIM memory holds both code and data, and steers the instruction execution in clusters. The execution clusters are data-driven execution model. Temporal computing results can pass to the next processing unit through register communication directly, and it is not necessary to write them into registers again. It can avoid using massively hardware mechanisms which neither improve the performance nor enhance the scalability.
Keywords/Search Tags:vector processing, processor in memory, memory wall, parallel processing, billion transistor architectures, petaflops
PDF Full Text Request
Related items