Multi-dimension And Multi-level Associated Cache PartitioningMechanism In CMP

Posted on:2014-01-10

Degree:Doctor

Type:Dissertation

Country:China

Candidate:S Li

Full Text:PDF

GTID:1228330395996581

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

By the restriction of Power dissipation and complexity of chip-designing，thetraditional Superscalar Processor can not meet most people’s need. At the same time,as the development of computer hardware technology,the Chip Multi-ProcessorSystem is becoming the mainstream of the current development.Thechip-multiprocessor intergrates the multi-core into the chip so that it can make full useof the growing transistor resources;in the meantime the further development andutilization make the thread level parallelism become possible[1].However a series ofquestions emerged subsequently. First, now the performance gap between modernprocessors and memory is a primary concern for computer architecture. Processorshave large on chip caches[2-6]and can access a block in just a few cycles, but a missthat goes all the way to memory incurs hundreds of cycles of delay. Thus, reducingcache misses can significantly improve performance. Second, the microprocessor canexecute multiple applications simultaneously, these applications share the last levelcache, but the traditional cache replacement policy, Least Recently Used （LRU）,doesn’t distinct accesses from different applications. Thus the Cache pollutionphenomenon occurs, and it leads to some problems such as the system performancereduction and uncertain execution time.About cache pollution, since the number ofparallel executing threads is getting more and more,however the traditional cachereplacement-LRU does not distinguish the cache memory space and the related databetween different threads. So it is possible that when the cache access of a certainthread misses and it needs to be replaced, the cache data replaced happens to belongto the cache data of another thread, and interferring with cache access andreplacement of another thread.Thus the the cache missrate greatly increases,andbecause of the synchronization problems between threads, parts of threads’ executiontime will not identify, and more serious phenomenon-thread starvation may evenoccur.In order to solve the above problem this paper proposes two new hardwaredesigns and algorithms,and they solve the problem well; meanwhile we experimentbasd on the three-level cache system architecture, at the same time for the complexsystem structure of three level cache experiments, the one or two level for the privatecache, three level for sharing cache, the first and second level cache are private,the third level is shared.The first method is based on associated cache structure;the associated cachestructure design in CMP try to make the ensemble of private L2caches appear like acollective shared cache through Cache Shadow Directories. Thus cooperative cachingresources can fit the demand of the multiple applications for cache. For example,associated caching will attempt to use remote L2caches to hold data that wouldgenerally not fit in the local L2cache, if there is spare space available in a remote L2cache.At the same time the Cache Shadow Directories are integrated with the sharedcache partitioning algorithm to improve the system performance.The second method to decrease the missrate is to increase the number of liveblocks in the cache. The cache block alternates between two states （a） the―live‖statebegins when a cache access misses,then followed by a series of hit to the cache block（b）the―dead‖state begins with a last access to the cache block,then followed by aseries of cache access misses.So a cache block is live if it will be referenced againbefore its eviction. From the last reference until the block is evicted the block is dead.Cache efficiency can be improved if more live blocks are stored in the cache withoutincreasing its capacity. Improved cache efficiency reduces cache-miss rate andimproves system performance.To achieve better efficiency, dead blocks should be identified early. The earlier adead block is identified, the more opportunity there is to improve cache efficiency. Ablock turns dead on its last access before its eviction from the cache. Theidentification of a dead block should be done between the last access to the block andits eviction from the cache. Since the hardware does not know with certainty whichaccess to a block is the last access, the identification of a block as dead is aspeculative action called dead-block prediction. Lai et al. were the first to propose theconcept of dead-block prediction and a trace-based predictor. Now three approachesfor dead-block prediction have been proposed: trace-based, counting-based andtime-based. But all these methods are based on a single-core processor and level twocache. This paper proposes multicore Cache Block Predictor including multicore deadtrace predictor （MDTP） and Multicore Tag Correlating Prefetcher in view of thecomplex system structure of chip-multiprocessor and Level three cache. Multicoredead trace predictor notes the corresponding cpu core for every cache trace coding,thus such a new memory reference will quickly find matching items of a certain corein the predictor history table. If the table indicates a match, the MDTP predicts thatblock is dead. Multicore Tag Correlating Prefetcher uses a more effective means,replacing the dead block with prefetching block, and it increases the systemperformance greatly. For the conflict problem cache partitioning technology is an effective method tosolve the conflict of the shared Cache access.Shared cache partitioning are used todivide last level shared cache into some independent cache sets, so each applicationhas its own cache set and cache pollution could be avoided.At the same timeaccording to certain criteria, in view of the different process demand differentamounts of cache resources can be allocated to meet every process’demand. TheUF-ACP（Utility and Fair-based Associated Cache Partitioning） algorithm wepresented of the last level Cache is based on utility and fairness, at the same time thealgorithm takes the process priority and the thread number into account andimproving throughput and fairness greatly.This paper adopts paper proposed fairness metric formulaM_i,j=|X_i X_j|WhereX=Missr_shrii/Missr_ded, Xi, Xjrepresent two different application, LetMissr_shridenote the miss rate of the application i when it runs alone with a dedicatedCache, and Missr_dedidenote its miss rate of the application i when it shares theCache with other applications. The ideal fairness is that making the value Mi,j＝＝0for any two different applications, but it is impossible under the actual sharingenvironment. So the algorithm in this paper will enable Mi,jless than a certainthreshold thrs, thus ensuring fairness.The utility criteria based on missrate is inspiredby literature and the parameter that U=（MisRatep（a）–MisRatep（b））/（b-a） isdesigned;the parameter U shows that when the application is allocated Cache columnsincreasing from a to b, the miss rate change of the application is MisRatep（a）–MisRatep（b）. On this basis, this paper allocates cache space to the application of themaximum missrate change at the unit cache space change. Because this shows thatsuch applications are influenced greatly by the cache space change.The experiment results show that the Cache Shadow Directories along with theshared cache partitioning algorithm improves throughput by up to43.38%, on average17.56%over LRU; the multicore cache block predictor along with the shared cachepartitioning algorithm in this paper improves the prediction coverage and accuracy47.73%and11.69%on average in L1cache,21.28%and16.13%on average in L2cache over DBP proposed by Lai; the multicore cache block predictor along with theshared cache partitioning algorithm improves throughput by up to46.15%and18.27%on average over the traditional non-partitioned shared cache with LRUreplacement; and it improves the fairness by a factor of3.9on average over LRU andby a factor of4.0on average over Utility based Cache Partition （UCP）.Finally the paper proposes the next research direction and it is that making the design algorithm of our article prefect and applying it to the more complex systemarchitechture such as cluster environment.

Keywords/Search Tags:

Cache Partitioning, IPC, Fairness, Prefetch, Multicore Cache Block Predictor

PDF Full Text Request

Related items

1	Classification-based Prefetch-Aware Cache Partition Mechanism
2	Adaptive Cache Management Policies For High Performance Microprocessors
3	Cache Partitioning Policies On Chip Multi-processors For Scientific Applications
4	The Design Of Shared L2-Cache Structure Based On Heterogeneous Multicore System
5	Design And Implementation Of A Web Cache And Prefecthing System
6	Dynamical Cache Partitioning Based On Cache Missing Ratio And Process Priority
7	A Design And Implementation Of Inter-Thread Cache Interference Elimination Structure Based On Cache Partitioning
8	Application Research Of Data Cache Technology In MIS
9	Study On Cache Partition Optimization Based On Non-stacked Cache Replacement Algorithm
10	Cache Optimizations And Parallel Simulation For Multi-threaded Workloads