Font Size: a A A

Memory Optimization On Chip Multi-core Processors

Posted on:2012-09-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:L DengFull Text:PDF
GTID:1118330341451751Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the increasing transistor density on the Very Large Scale Integration (VLSI) chips, Single-chip Multi-core Processor (CMP) has become the mainstream of single chip architecture development with their better scalability, lower design complexity and better performance per watt, compared to single-core processor. There are two kinds of CMP: homogeneous CMP and heterogeneous CMP. Both of them bring several challenges as they boost performance, including multiple memory hierarchy, shared cache competition and memory bandwidth limitation. Concerned with these questions, this paper has done research on three types of memory optimization technologies, such as algorithm-level optimization technology face to irregular memory access applications, cache structure optimization technologies face to irregular memory access applications and online transaction processing (OLTP) application, and memory access scheduling optimization technology for improving memory bandwidth utilization.For algorithm-level optimization, we start with memory characters analysis on dense and sparse irregular matrix computing, and setup the prioritized storage model for reusable data. And then we setup a unified formula on the average bandwidth of DMA bases on a series of DMA experimental tests. With the support of performance formula, we perform 6 types of memory optimization technologies for irregular matrix computation on heterogeneous processors, including temporary matrix elimination, blocking matrix parallel computing, overlapping computation with data transfer by multi-buffering, overlapping local storage access latency with loop-unrolling, reducing control instruction consumption via changing store format of matrix, and PPE parallel computing. Taking Cell BE processor as our test-bed, for SWIM and CG benchmarks, we obtain 13.51 and 21.75 times speedup compared to single PPE respectively.For cache structure optimization, we optimize memory access in two aspects: cache partition technology and new cache structure. Via memory access characteristics analysis, we propose a memory access behavior characters-orient cache partition method. According to performance evaluation with 12 sparse matrixes come from university of Florida sparse matrix collection, we eliminate all capacity cache misses and conflict cache misses, and reduce capacity requirement for sparse matrix vector multiplication. On the other hand, we propose a software-controllable semitransparent cache for OLTP. With analysis of database management system and memory access characteristics of OLTP application, we build a data classify model. Base on the model, we divide data into three types, such as discard, protect and free competition. Results from full system architecture simulator show that, we obtain 35% cache miss droop rate at most, compared to transparent cache. For memory controller optimization, we focus on memory request scheduling algorithm. Take interference of memory requests from difference cores and electrical characteristics of memory into consideration, we propose a tow stages memory request scheduling algorithm. The algorithm concerns not only the fairness of requests among cores, but also the utilization of memory bandwidth. In the first stage, we assign cores difference priority to satisfy real-time requirement for important threads. For those threads with same priority, we employ a fair scheduling algorithm base on multi-core memory request waiting time evaluation model. In the second stage, towards the properties of state-of-art DDR3 DRAM memory, we build a memory request limitation model, and propose an anti-starvation multi-channel lease waiting time scheduling algorithm. Finally, for 10 benchmarks from SPEC CPU 2000, on cycle-accurate simulator, we reduce 33% waiting time and obtain 1.49 times speedup compared to FCFS algorithm. Meanwhile, with fairness scheduling algorithm, we limit unfairness among cores of memory access performance slowdown less the 1.1.
Keywords/Search Tags:Multi-core processor, irregular matrix computing, OLTP, memory optimization, cache partition, shared cache structure, memory request scheduling
PDF Full Text Request
Related items