Font Size: a A A

Key Technology Of Improving On-chip Cache Utility On Multi-core Systems

Posted on:2016-10-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:S SunFull Text:PDF
GTID:1228330470457949Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Hardware Cache memory can take full advantage of Instruction and Data locality. The required Instruction and Data is put into high-speed cache beforehand to reduce the access latency. Currently, the design capability of architecture has been able to integrate high capacity cache into a chip in order to ease the impact of bandwidth and speed and to improve whole system performance. With the multi-core architecture and on-chip cache organization vary, cache face many new problems in the design:(1) The traditional single-threaded application cache serious waste of hardware cache resources. Especially in the multi-core architecture with distributed organization of on-chip caches, single-threaded application can only use part of hardware cache resource.(2) Cache coherence protocol becomes more complex, especially after the introduction of multi-threaded mechanism. When to maintain cache coherence, the system not only consider the relevant data with a single thread, but also need to consider the data interaction between threads. Therefore, this complicated protocol design provide a challenge to maintain cache coherence and cause plenty of cache coherence misses.(3) Data accessing latency are more high while processing a cache miss in multi-core architecture. Due to the interaction and communication between cores is done through shared cache. However, after the idea of multi-threading is introduced, cache miss handling mechanism in multi-core systems become more complex. Therefore, the overhand required for handling cache miss can not be ignored. In addition, many of cache management mechanism are adjusted according to the application access characteristic and multi-core architecture, for example the organization of on-chip hardware cache, private/shared selection mechanism, replacement policy and cache partition mechanism, in order to get a trade-off and compromises between the low access latency and high cache hit rate.For those above problem, this dissertation study how to add some low-cost performance monitor unit to monitor shared data access pattern of running parallel application in real-time and using this run-time information to achieve more efficient management of hardware cache resource and improving utilization of cache resource effictively in single-threaded applications, for the purpose of reducing cache misses and cache access delay. This dissertation will focus on three aspects of improving on-chip hardware cache utilization:1. Proposed a new light cache management technique called Light Virtual Unified Cache Partitioning(LVUCP) based on previous VSCP cache management mechanism for solve the problem of low utilization of cache resource when running single-threaded application on multi-core architecture. LVUCP combine all distributed on-chip cache into a whole big virtual cache, and enables single-threaded application to manage and utilize the entire on-chip cache resource with light data spreading cost. This mechanism extend explicitly the using cache capacity, and can specify the required data with high locality to reside into the cache space to reducing cache miss. Compared with the method which to maximize computing resource by parallelizing the application, LVUCP try to maximize the hardware cache utilization in order to acquired the needed data with faster speed. Experiment results showed that programs with the help of LVUCP have significantly performance improvements (57%on average) and good scalability on large-scale multicore systems (surpasses200%in the best case).2. Proposed a shared data aware thread scheduling mechanism for solve the problem of threads interaction cost. Our mechanism reduce the access latency of thread interaction and improve the whole performance. The interaction between threads is done through accessing shared data, and this interaction is with phase feature. The proposed shared data aware thread scheduling mechanism can monitor the access pattern of shared data in real-time, and advise the system scheduler to mapping those threads with high shared data into a core-group. This simple mapping ways can dramatically reduce the latency to maintain cache coherence and reduce the data replica. Results shouw that performance gains of up to7%, and on-chip cache miss rate reduction by an average of15%compared to traditional load balancing thread scheduling mechanism.3. Proposed a shared data aware cache coherence transition strategy that collaborates with directory-based MESI cache coherence protocol to eliminate remote misses comparing to traditional write-invalidate cache coherence transition strategy. In threads interaction, the traditional write-invalidate cache coherence transition strategy result in plenty of coherency cache misses, seriously impact whole system performance. The proposed mechanism improve the whole system performance by eliminate the coherency cache misses when running parallel application. Through detecting instance of coherence misses of a cache line using a simple directory-based predictor, we can demonstrate that these shared cache line are accessed frequently. In this scenario, we alter the original write-invalidate cache coherence transition strategy to write-update cache coherence transition strategy, which can update the cache line in real time instead of invalidation, and covert next coherence misses to cache hit. Our work focuses on improving the performance of multi-threaded applications by eliminating coherence misses and coherence traffic. Especially in the parallel applications with high frequency interaction, this mechanism is particularly prominent. Results show performance gains of up to21%compared to the native directory-based write-invalidate cache coherence protocol.Based on the work of this dissertation, We got some important understandings:(1) In the background of increasingly serious situation of "memory wall", memory accessing performance is very important for a single program execution and whole system throughputs. Improving cache utilization is becoming more importantly than instruction optimization.(2) In parallel application, the access pattern of shared data has phase feature, traditional method based static analysis are not profiling the behavior of parallel programs.(3) The interaction of parallel program thread is done by access shared data, therefore, the using and maintain of shared data is major cause of low utilization of on-chip cache resource.
Keywords/Search Tags:Multi-core architecture, Cache coherence protocol, Write invalidateprotocol, Write update protocol, Thread schedule, Shared data
PDF Full Text Request
Related items