Font Size: a A A

Research On On-chip Memory Management And High Efficient Synchronization For Homogeneous Many-core Processors

Posted on:2012-01-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:X W ChenFull Text:PDF
GTID:1118330362960206Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Withtherapiddevelopmentofintegratedcircuittechnologyandthestrongpushfromapplication requirements, System-on-Chips (SoCs) are evolving from bus-based single-core/multi-core architectures to network-based many-core architectures. Due to identicalprocessor cores and routers, homogeneous many-core processors feature good regularityand scalability, facilitating exploiting the parallel potential of multiple processor cores.Although homogeneous many-core processors have powerful parallel computing capac-ity, they bring new challenges in architectural design. One of challenges is how to provideefficient memory management and synchronization mechanism so as to achieve higherparallel performance. Memory management and synchronization mechanism have be-come two crucial researches in designing many-core processors.First of all, architecture characteristics and parallel program behaviors of homoge-neous many-core processors are analyzed, two performance models are established anddiscussed, andahomogeneousmany-coreprocessorexperimentalplatformisconstructed.Secondly, this paper addresses the memory and synchronization issues in two aspects:on-chip programmable memory management technique and high-efficient dual-channelhardware synchronization mechanism. This paper proposes "Distributed Shared MemoryOriented Data Management Engine", "Static and Dynamic Partitioning of Hybrid Dis-tributed Shared Memory Space", "Fast Dual-channel Semaphore Synchronization withDynamic Buffer Allocation", and "Fast Dual-channel Barrier Synchronization with Coop-erative Communication". They are evaluated by analyzing hardware costs, building per-formance models, and applying both synthetic experiments and application benchmarks.The main contributions of the paper are summarized below:1) Two concepts (Equivalent Serial Packet and Equivalent Serial Communication) aredefined so as to construct a quantitative network communication model. Further, twoperformance models of homogeneous many-core processors are built under uniformand hotspot traffic patterns. After analysis and discussion, some suggestions regardinghow to consider the "parallelization - communication" dilemma in architectural designand program development are given.2) In order to extend the application scope of homogeneous many-core processors, usingmicrocodeapproach, aDataManagementEngine(DME)isdesignedandimplementedfor on-chip Distributed Shared Memory (DSM) management. The DME allows usersto implement various functions in microcode according to different applications. The DME contains two coprocessors that can concurrently serve requests from the localnode and the remote nodes via the on-chip network. To use the DME, the commandtriggered microcode execution mechanism is proposed and the microcode library andthe microprogramming flow are developed. Guided by the microprogramming flow,DSM functions are implemented in microcode. Experimental results show that, asthe network size is scaled up, the delay overhead incurred by the DME is relativelow in comparison to the network communication overhead. It can be concluded thatthe proposed microcode solution has not only the reasonable delay overhead close todedicated hardware solution but also the flexibility of software-only solution.3) ToreduceVirtual-to-Physical(V2P)addresstranslationoverhead,aHybridDistributedShared Memory (HDSM) space is proposed and its static and dynamic partitioningtechniques are explored. In the HDSM space, the local memory is partitioned into twoparts: private and shared, using two addressing scheme: physical (real) and logic (vir-tual). The design philosophy is to support fast physical memory accesses for privatedata and globally virtual memory accesses for shared data. Within the static partition-ing, the organization of the HDSM space is fixed at design time and never changeswhen the system is running. Within the dynamic partitioning, the private region andthe shared region of the HDSM space can be configured at runtime. Experimentalresults exhibit that the proposed HDSM demonstrates performance advantages overthe conventional DSM. In our experiments, the maximal performance improvement is37.89%, the minimal performance improvement 3.68%.4) For the sake of alleviating the serialization of semaphore synchronization, a fast dual-channelsemaphoresynchronizationwithdynamicbufferallocationisproposedtoelim-inateHead-of-Line(HoL)blockingandimprovebufferutilization. Eachnodecontainsa dual-channel Semaphore Synchronizer with dynamic buffer allocation (SS). The SSoffers a set of lock variables that are globally addressed and visible to all nodes. TheSS can concurrently respond to synchronization requests from the local node and theremote nodes via the on-chip network. The physical buffers are dynamically allocatedto logically form multiple virtual buffers that correspond to lock variables, aiming foreliminatingHoLblockingandimprovingbufferutilization. Experimentalresultsshowthat,comparedwiththespinlock, theSShasbetterperformanceandprovideshomoge-neousmany-coreprocessorswithahigh-efficientsemaphoresynchronizationsolution.5) For the sake of alleviating the serialization of barrier synchronization, a fast dual-channel barrier synchronization with cooperative communication is proposed to op- timize the network communication overhead. With it, barrier synchronization packetsarebroadcastedintheon-chipnetworkandmergedintoasinglebarriersynchronizationpacket if they aim for the same barrier, eliminating network contention among them.A cooperative communicator is designed in the router to provide hardware support forbarrier synchronization packets' cooperative communication. Routers collaborate witheach other to accomplish a barrier synchronization task. Experimental results showthat, with the help of cooperative communication, the all-to-all algorithm is pushedfrom the worst to the best solution, having great scalability.
Keywords/Search Tags:Homogeneous Many-core Processors, Network-on-Chips, Dis-tributedSharedMemory, Microcode, SemaphoreSynchronization, DynamicBufferAllocation, Barrier Synchronization, Cooperative Communication
PDF Full Text Request
Related items