Design And Optimization Of Core Achitecture In The Many-Core Processor

Posted on:2009-11-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y P Liu

Full Text:PDF

GTID:2178360242481313

Subject:Circuits and Systems

Abstract/Summary:

PDF Full Text Request

The multi-core architecture is development trend for high-performance microprocessor architecture, multi-core and many-core architecture become a important issue for micro-architects. In the paper presents design of core in many-core architecture. The many-core consists of 24 title nodes and synchronization node, connects by the crossbar and mesh on network, between nodes takes static routing strategy. Each title node consists of 4 cores, ICACHE and DCACHE, LOCK MANAGER, which implements synchronization manage on hardware. In the core architecture adopt 8 levels pipeline in order issue, compatible with MIPS R2000 instructions, contains a part of Coprocessor1 instructions. Instruction cache and data cache store instruction and data, both size is 256B,has 7 functional units , there contains fixed ALU,multiply unit and divide unit, floating point ALU and floating point multiply unit,floating point divide unit, memory unit .Pipeline contains data dependence,control dependence and architecture dependence, in the design adopt different method to process them. For architecture dependence, has separated instruction and data cache ,another side has only one functional unit to execute in the same time .Functional unit exploits pipeline's method , guarantee instruction with exploiting same functional units continuously issue .Modern branch prediction structures rely on large tables to store branch histories and cache branch target address or instructions for incremental increases in performance and thus tend to degrade performance per transistor, so design of core combines with complier solve control dependence, implement a static conditional and unconditional branch prediction, takes always taken jumping model. Complier inserts delay slot to guarantee pipeline continuously stream. Core takes to stall pipeline to solve data dependence, but this method impact on performance, IPC of fixed instruction program is 0.3. Core adopts forwarding logic to solve RAW. The cause that abandon dynamic schedule which frequent method for modern high-performance microprocessor is excessive area, by evaluating area of reservation and reorder buffer of processor discover to area incremental of core be equal to 2 fixed ALU and floating-point ALU, moreover the arithmetical unit the proportion which occupied in the core overall area dropped 4.4%, does not conform to in the design to the arithmetic unit area close 80% of core area. Therefore give up to use the dynamic scheduling solution data dependence, on the contrary introduce forwarding logic promotion core performance. Uses Design Complier to synthesize the FORWARD module, occupies the area is small, is only equal to a decoding logic. When uses hardware language Verilog carries on the description, in view of the FORWARD logic latency major problem, through the compilation code style reduces the latency, achieved finally designs the request. The directional technology introduction enables core fixed point procedure IPC to achieve about 0.5,this method causes the core IPC promotion 0.2.In the modern computer memory the address all defers to the byte division, visits saves the instruction address all is defers to the word, the byte neat to. In order to expand the core performance, enables it reads the data perhaps when and other CPU sharing data from a document to be possible to implement unaligned loads and stores, increased in the core architecture LWL/LWR and SWL/SWR instructions. Through to these four instruction semantics explanation, realizes these four instruction RTL description. When many-core processors share resource, maybe have the competition. When the operation shares data code section, namely the critical region, this code section must execute by the atomic. When between the thread uses the same resources, to in the critical region variable use is a very important question. To avoid many core concurrent use critical region the resources, should use the synchronization mechanism. On the hardware has provided the synchronized primitive, realizes synchronization between the cores by two instructions LL/SC. The LL/SC instructions are in the MIPS instruction support the atomic operation , changes two instructions in the processor architecture design the semantics, is equal separately in the request lock the symbol which unlock. The core according to two instructions entrust with the operation code, implements two instructions in the MEM module. In the LSU unit will the sharing variable by the message transfer method by way of the network core ID to pass on to LOCK MANAGER, The LOCK MANAGER managements lock request. When unlock message transfer to two DCACHE on the title through network, unlock message transfer to LOCK MANAGER by the DCACHE, guarantee to correct use the sharing data.After using hardware description language Verilog to describe core architecture, the confirmation code function accuracy is the very important issue. In this paper uses the simulator comparative method, the C language simulator played the vital role in the structure .This article describes the C simulator is the MIPS set of instructions many-core processor simulator frame which an event actuates, has the easy to operate, the disposition nimbly, the operating procedure speed quick, does not modify the compiler, the extended strong characteristic .Test the C simulator by the single thread and the multi-thread massive procedures to guarantee the accuracy, the single core single thread had used GCC_TOURTURE in thousand test programs to guarantee function accuracy. When the C simulator and the RTL simulator compare each other, connects by the PLI mechanism the variable value exchange between the C language and Verilog. Builds between two computation nodes the confirmation platform, through the instruction operation code, the end value and goal register these three aspects contrasts an instruction execution the accuracy.

Keywords/Search Tags:

Optimization

PDF Full Text Request

Related items

1	Improvement And Application Of Brainstorming Optimization Algorithm In Several Categories Optimization Problems
2	Studies On Optimization Methods With Extremal Dynamics And Applications
3	Improved Bare-bones Particle Swarm Optimization Algorithm And Its Application
4	The Route Optimization Of WSN Based On The Combination Of Ant Colony Optimization And Patricle Swarm Optimization
5	Hybrid Intelligent Algorithm And Its Application In Optimization Problems
6	Research On Evolutionary Optimization And Learning For Complex Continuous Optimization Problems
7	Researchs On Optimization Algorithms Based On Generalized Response Surface Of Complex Black-box Model
8	Study On Optimization Theoretical Analysis And Methods Of Design For Testability In SoC
9	Research On Optimization Based Onswarm Intelligence And Its Application
10	Emperor Penguin Optimizer For Some Kinds Of Optimization Problems