Font Size: a A A

Co-optimization Of On-chip Interconnects And Cache Coherence For Multi/Many-core Systems Based On Multithread Application Characteristics

Posted on:2017-04-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q HuFull Text:PDF
GTID:1108330488491030Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
As the number of integrated cores in a high-performance computing system reaches 16 or more, system performance reaches TFlops level. The computing cores need interconnection to implement a large-scale system. The coordination between interconnects and cache coherence is the most prominent problem, though the interconnection among cores is essential as well. The interconnects among cores must coordinate accesses to distributed shared memory. Cache coherence takes much communication resources, increases latency, lowers the efficiency of parallel computing, and influences system scalability. Hence, data communication and storage must be considered as a whole to improve computing efficiency and system performance. This paper analyzes data characteristics of multithread applications from different aspects, finds out the performance bottleneck of multi-/many-core systems, and helps exploit the optimization space. Memory subsystems and interconnect subsystems are co-optimized, the effectiveness of data storage, maintaining, and communication is improved, so that the overhead of storage, latency, and energy can be reduced, and the system overall performance can be improved.Firstly, multithread applications’characteristics are analyzed through experiments. Multiple metrics of several aspects are profiled, including application working set size, data sharing, data locality, data and threads affinity, communication traffic, cache coherence overhead, etc. A statistical method is employed to derive the distribution function of certain metrics. The requirement of storage, cache coherence maintaining, and communication traffic of different applications are well understood, which helps us exploit the optimization method of memory subsystems and interconnect subsystems, and exploit the room of optimization. The analysis lays foundations for our research.Secondly, data and threads affinity is employed and threads and data mapping is optimized, based on the statistical analysis of applications’traffic. Our goal is to improve data access locality, reduce traffic, and lower network power. The analysis of applications’characteristics indicates that, threads and data affinity varies a lot from application to application, leading to different traffic requirement. The mapping of threads and data directly influences traffic. An affinity-aware threads and data mapping method is proposed, which is a heuristic-based algorithm, taking load balance into account. To approach the minimum traffic, simulated annealing algorithm is employed as well. Simulation of 16-way chip multiprocessors shows that, on average, traffic is reduced by half, network power is reduced by 42%, and system performance is improved by 9%.Thirdly, on-chip interconnects and cache coherence protocol are co-optimized based on the analysis of coherence and communication requirement of multithread applications. Our goal is to improve the efficiency of cache coherence and on-chip communication, lower the cache coherence overhead, and reduce the on-chip communication latency. On-chip transmission lines are employed as a latency-optimized network and combined with packet-switched networks to create heterogeneous interconnects. Different type of messages are adaptively directed through selected medium of the heterogeneous interconnects to enhance cache coherence effectiveness. To exploit the performance potential created by the heterogeneous interconnects, caching and coherence strategy is tuned adaptively based on on-line profiled data locality. Simulation of 16-way chip multiprocessors shows that, on average, the coherence latency overhead is reduced by an order of magnitude, data accesses are twice as fast, traffic is reduced by 32%, network energy is reduced by half, and system performance is improved by 14%.Finally, an adaptive multi-granular tracking scheme is proposed to reduce the storage overhead of coherence directory, based on the analysis of data spatial locality and sharing of multithread applications. Different from the most basic incarnation where a directory entry is allocated for each cache line, we adaptively adjust the granularity of coherence based on the run-time spatial locality of applications. The consequent memory cache lines of a region could be described by one region entry in the directory. Combined with the conventional cache line entry, extra coherence overhead caused by false sharing is avoided. The expressive adaptive multi-granular directory is further applied in large-scale sixteen-socket systems. A victim table is proposed to reduce invalidations to cache lines in multi-socket systems, improving utilization of cache resources. Simulation results show that, in both on-chip and multi-socket systems, the storage overhead of coherence directory can be reduced by an order of magnitude, with the system performance hardly influenced.This paper analyzes data characteristics of various multithread applications, based on which memory subsystems and interconnect subsystems in multi-/many-core systems are co-optimized. The proposed co-design method improves the efficiency of data storage, maintaining, and communication. The area of directory cache, on-chip latency, and network power are all reduced, and the system overall performance is improved.
Keywords/Search Tags:multi-/many-core system, cache coherence, directory cache, interconnects
PDF Full Text Request
Related items