Data volume is exploding with the rapid development of applications such as artificial intelligence,the internet of things,and big data.High-performance,low-power computing systems are urgently needed to support large-scale data processing.Processor cache associates the cores and off-chip main memory and plays an important role in system performance and energy consumption.The emerging Spin Transfer Torque Magnetic RAM(STT-MRAM)has good performance.Compared with SRAM(Static Random Access Memory),it has the advantages of high density and low static power.Therefore,introducing STT-MRAM into computing systems can greatly reduce cache energy consumption and increase cache capacity.However,STT-MRAM suffers from high read/write energy and long access latency,and STT-MRAM has a limited wear lifetime.Therefore,it is necessary to consider optimizing the overall performance and energy consumption in STT-MRAM-based cache systems.This work applies STT-MRAM cache to computing systems such as CPU and GPU to explore reducing cache energy consumption and improving system performance.To address the problem of high energy consumption and long access latency in multilevel STT-MRAM caused by the two-step read and write operations,the one-step read and write(OSwrite)method is proposed.OSwrite aims to minimize data writes into the hard lines by compressing and encoding data and only writing it into soft lines,thereby improving system performance and reducing energy consumption.Based on the data characteristics of several workloads,OSwrite identifies a large amount of zero data and unchanged(clean)data in the Level 3(L3)cache.Therefore,the writes to clean data can be avoided,and zero data can be compressed to reduce data writes.In detail,OSwrite includes cache line skipping and four cache line encoding techniques.The cache line skipping scheme bypasses data writes to all-zero or clean cache lines to reduce access latency.For the cache line encoding techniques,compression and encoding methods are used to eliminate a large amount of zero and clean data in cache lines,avoiding the data writes to hard lines and writing the data only in the soft lines,thus achieving one-step write.For read operations,OSwrite reads data only from the soft lines based on the encoding type,achieving one-step read.For cache lines that cannot be read and written in one step,the smart flipping encoding technique is adopted to reduce energy consumption by decreasing the two-step writes of MLC STT-MRAM.The evaluation results on the SPEC CPU 2006 workloads show that OSwrite increases the lifetime of MLC STTMRAM L3 cache by 49.23%,decreases dynamic energy consumption by 25.52%,reduces access latency by 28.66%,and improves system performance by 3.71%.To address the problem of high energy consumption and performance degradation caused by massive off-chip memory accesses in GPUs,a novel off-chip read and write access optimization method based on STT-MRAM L2 cache(CMD)is proposed.Adopting STT-MRAM to increase the L2 cache capacity and reduce off-chip duplicate data read/write accesses in GPUs,thus improving the system performance and reducing energy consumption.CMD analyzes the characteristics of off-chip requests and data features and classifies DRAM requests into write requests,updating read requests,and read-only requests.For write requests,due to abundant duplicate data in GPU applications,CMD eliminates these writes by mapping multiple identical data blocks to the same reference block.For updating read requests,if the requested access is for duplicate data and the reference block is already in the L2 cache,the reference block is copied to the corresponding missed address in the L2 cache.Finally,for read-only requests,a small first-in-first-out buffer is added to the L2 cache to store clean and valid read-only data when it is evicted,thereby avoiding potential off-chip accesses to the block.Through extensive evaluations on a series of GPU workloads,the experimental results show that CMD can improve performance by 53.22% and reduce cache and off-chip memory energy consumption by 52.62%.Furthermore,CMD can work with OSwrite to further improve performance by 22.42% and reduce cache and off-chip memory energy by 13.93%.To address the problem of high dynamic read/write energy consumption and limited lifetime of STT-MRAM cache in image processing applications,a novel approximate cache method based on STT-MRAM(APPcache+)is proposed.The proposed method reduces the amount of data being read and written by removing redundant data sub-blocks through similarity compression,thereby reducing the dynamic read/write energy consumption of the STT-MRAM cache and improving its lifetime.APPcache+ utilizes the error tolerance capability of image processing applications and the data redundancy in cache lines,proposing a light-weight similarity-based approximate compression algorithm that significantly removes redundant data sub-blocks,thereby reducing the amount of data being written and lowering the write energy consumption of STT-MRAM cache.Additionally,APPcache+ includes a partial read scheme to reduce the read energy consumption of STT-MRAM cache.In the traditional decompression process,the entire cache line is read into the decompressor,resulting in reading invalid parts of the cache line.Therefore,the partial read scheme divides the data into blocks and only reads the compressed part of the data blocks.Finally,to solve the problem of unbalanced bit writes caused by the compression scheme,a light-weight cache line wear leveling scheme is proposed to improve the lifetime.Extensive evaluation results on a series of image processing applications demonstrate that APPcache+ can reduce cache energy consumption by 32.58% and improve lifetime by 40.75% compared to state-of-the-art approximate compression techniques while maintaining an average application quality loss of only 1.86%. |