Design Of 128 Bit SIMD Arithmetic Unit Based On Subword Parallel Technology

Posted on:2017-09-05

Degree:Master

Type:Thesis

Country:China

Candidate:J K Shan

Full Text:PDF

GTID:2348330488972989

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With extensive use of multimedia processing, DSP signal processing and 3D graphics, vector computing unit supporting SIMD calculation plays a more and more prominent role in the field of modern processor designing. Vector computing unit often has a large area, and is also in the calculation of the critical path, its design scheme directly affects the performance of the entire CPU. In this thesis, a high performance implementation scheme based on subword parallel technology is presented for the 128 bit SIMD IBM complex vector arithmetic instruction architecture of POWER processor.The 128-bit SIMD arithmetic unit of this thesis is compatible with POWER instruction set of 27 vector class instruction, including Vector Integer Multiply, Vector Integer Multiply-Add/Sum, Vector Integer Sum-Across 3 kinds of instruction, support fixed point saturation operation, the operation process is completed in 6 clock cycles. The design consists of three major components: subword parallel multiply-accumulator, choice of components and an accumulator, which is the core parts of design is subword parallel multiply-accumulator.According to the function of the instruction set, the SIMD arithmetic unit implements 4 32-bit subword parallel multiply-accumulator. Each multiply-accumulater supports 1 32�32-bit, 2 16�16-bit or 4 8�8-bit signed/unsigned operations and mixed symbol(signed�unsigned) operations in 8-bit mode, and in the 16-bit mode to support the judgment of saturation operation. In this thesis, the key components of the subword parallel multiply-accumulator are designed in detail, and a variety of implementation methods are given. There are two realization methods to be given in the design of the partial product generation components. The implementation of mixed subword parallel method does not need to consider the dissemination of the carry chain and the compressor and adder is also simplified, thus the implementation of the circuit logic is simple; Booth selection method can make the partial products cut by half, which greatly reducing compression circuit delay. In the implementation of compression component, respectively improve the 3-2 compressor and 4-2 compressor to make it adapt to the three modes of compression, and use the improved type of Wallace tree structure compression, compressors only needs to add a small amount of control logic can be in support of a variety of modes at the same time without increasing extra delay. The structure of the adder is selected by the comprehensive performance of the LF parallel prefix adder, which uses the carry truncation mechanism to implement the sub word parallel function. In the design of the saturation judgment part, we give a method of saturation judgment based on the addition and subtraction method and optimize of the method, then analyze the pre-judgment overflow technology of multiply add operation and perform a method which is suitable for the design combining with the instruction set. We can implement any bit high performance multiplier/multiply-accumulator with smaller cost in the way of subword parallel technology from this thesis.The design is 6 stages pipeline structure, which use the UVM platform for verification, the 128 bit SIMD arithmetic unit area is 590015(?m2) and the highest frequency is up to 350 MHz under the DC integrated tool environment with the SIMC 0.18?m technology library. Compared with common multiplier, the design of the subword parallel multiplier can achieve a variety of complex vector arithmetic operations. The DC results show that, compared with the common multiplier, the time delay is only increased by 9.1%, while the area only increased by 5.9%. Compared with the traditional multiplier/multiply-accumulator, the high performance parallel technique proposed in this thesis has obvious technical advantages, which can meet the requirements of the vector calculation of high performance CPU in SIMD operation.

Keywords/Search Tags:

SIMD, Subword parallel, Multiply/Multiply-add, Booth algorithm, Addition

PDF Full Text Request

Related items

1	Research And Design Of Floatins-Point Accelerator
2	The Design And Implementation Of High-performance64Bit Fixed-point SIMD Multiply Accumulate For FT-XDSP
3	The Design Of Floating-Point Multiply-Add Fused Units In General Purpose Processors
4	Design And Implementation Of The Low-Power DSP Multiply-Add-Fused Unit
5	Evaluation of new multiply and multiply-accumulate structures in FPGAs
6	The Design And Implementation Of Multiple-precision Floating-point Multiply-Add Fused Unit
7	The Design, Optimization And Verification Of Fixed-point Multiply Accumulate For X-DSP
8	The Design And Verification Of Multiply Unit Of 600MHz YHFT-DX
9	The Design And Implement Of Floating-point Fused-multiply-add Unit For High-performance Microprocessor
10	Research And Optimization On Low Power Floating Point Multiply ADD Fused Unit