Research On Microarchitecure Optimization Methodology For Data Processing Units

Posted on:2017-02-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhou

Full Text:PDF

GTID:2308330488990981

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the development of computer techniques, applications have been more complex and appeared to have diverse running characteristics. The data processing units are very important in computer systems, the performance of which is relevant to applications running on them. The conventional fixed configuration can not benefit all applications. This thesis exploits adaptivity methodologies in the processor microarchitecture design from data processing units (multiply-accumulate unit, prefetcher engine, and data processing unit modeling) by combining with the characteristics of the applications.Firstly, multiply-accumulate (MAC) unit is an important architecture in microprocessors and data signal processors. Conventional high-speed MACs are usually split into multiple pipeline stages to increase throughput. However, pipelined MAC comes at the cost of increased latency and overhead in terms of power and area. At the observation that in multiply-extensive APE audio lossless decoder applications above 70% of the multiply instructions have at least one operand with 16-bit width (32Ã—16-pattern) and the 32Ã—16-pattern multipliy instructions are decentralized, we propose a data-aware MAC mechanism which can dynamically profile the multiply-extensive applications and tune its pipeline stages with respect to data width of the multiply operands, so that the stalls caused by multiply instructions can be reduced. The experiment on the full system simulation based on field programmable gate array (FPGA) shows that the proposed data-aware MAC unit can improve the performance of APE applications by 11% and energy-delay product by 15%.Secondly, the effect of data prefetching engine has great relationship with the applications, and no unique prefetching configuration can benefit all applications. Improper hardware data prefetching configuration can bring cache pollution and deteriorate system performance. We propose an adaptive prefetching engine with machine learning algorithms to predict the optimal prefetch configuration for different applications at runtime. The adaptive prefetching framework, based on decision tree, can learn from the memory access characteristics of the applications and then classify the prefetching configuration. The poposed prefetching engine dynamically tunes the prefetching parameters from the classifier. We train the decision tree with SPEC CPU 2006, EEMBC, and Olden benchmarks. The prefetcher based on machine learning improves the overall performance by 14% on average and improve the energy-delay product by 24% over a baseline system with no-prefetching. Our approach also outperforms the competitive prefetchers (CDP, GHB, and Stream).Finally, with the development of future workloads, the future data center has higher requirements on the throughput of data processing unit, the efficiency, and the self-management. Based on the characteristics of future large-scale workloads, we implement a software model of the data processing unit (DPU) in many-core system. The DPU is a two-issue 4-way simultaneous multithreading in-order processor based on ARM v8 instruction set, considering the tradeoff between power and performance. We implement the performance model based on the QEMU simulator, including the pipeline, cache, branch predictor, and inorder dispatcher. The simulator can run full system and can be extended to multicore or manycore systems. We have validated the simulator and evaluated the performance with the SPEC CPU 2006 benchmark suite.To recap, this thesis exploits the self-tuning methodology between processor microarchitecture and machine learning algorithms on the multiply-accumulate unit and data prefetching engine from the perspective of application characteristics. And a cycle-accurate in-house simulator of the data processing unit in manycore systems has been implemented.

Keywords/Search Tags:

Data processing unit, multiply-accumulate unit, data prefetcher, machine learning, self-tuning

PDF Full Text Request

Related items

1	Low-Power Design And Verification Of Vector Process Unit
2	The Design And Implementation Of High-performance64Bit Fixed-point SIMD Multiply Accumulate For FT-XDSP
3	Research And Design Of Multiplier-Accumulator Uint Based On RISC-V Instruction Set Microprocessor
4	The Design And Verification Of Multiply Unit Of 600MHz YHFT-DX
5	Evaluation of new multiply and multiply-accumulate structures in FPGAs
6	Application Of Gated Recurrent Unit In Equipment Maintenance Under Industrial Big Data
7	Research On Key Technologies Of VLSI Implementation Of Adaptive Filtering Algorithm
8	Key Topics Researching For Single GPU And GPU Heterogeneous Cluster
9	Research And Optimization On Low Power Floating Point Multiply ADD Fused Unit
10	Design,Optimization And Verification Of The Floating-point MAC Unit For The 32 Bit High Performance M-DSP