Font Size: a A A

Micro-Architectural Techniques to Alleviate Memory-Related Stalls for Transactional and Emerging Workloads

Posted on:2017-05-30Degree:Ph.DType:Dissertation
University:University of Toronto (Canada)Candidate:Atta, IslamFull Text:PDF
GTID:1468390014474164Subject:Computer Engineering
Recent technology advances enabled computerized services which have proliferated leading to a tremendous increase in digital data at a pace of 2.5 quintillion bytes produced daily. From individuals using internet-enabled mobile devices to pay for a taxi ride, to businesses relying on sophisticated platforms to predict consumer behavior, many such services are hosted on data centers which incorporate hundreds or thousands of computer servers. The cost of installing and more so of operating these server "farms" is significant. The performance/cost efficiency of the server machines thus dictates the costs needed to provide a desired level of support.;Unfortunately, modern server architectures are not well tailored for some emerging data center applications. Accordingly, this dissertation targets memory bound applications where memory-related stalls are a major source of execution inefficiency. Specifically, the dissertation focuses on Online Transaction Processing (OLTP) systems and on a set of emerging algorithms commonly used in the so called "Big Data" applications.;OLTP workloads are at the core of many data center applications. They are known to have large instruction footprints that foil existing first-level instruction (L1-I) caches resulting in poor overall performance. Several proposed techniques remove some instruction stalls in exchange for error-prone instrumentation to the code base, or a sharp increase in the L1-I cache unit area and power.;This dissertation presents STREX and SLICC, two programmer transparent, low cost techniques which reduce instruction cache misses significantly thereby improving the performance of OLTP workloads. Both techniques exploit repetition in the instruction reference stream within and across transactions, where a transaction prefetches the instructions for similar subsequent transactions. STREX time-multiplexes the execution of similar transactions dynamically on a single core so that instructions fetched by one transaction are reused by all other transactions executing in the system as much as possible. SLICC moves transactions among multiple cores, spreading the instruction footprint over several L1-I caches, virtually increasing the cache capacity observed by transactions. Both techniques use heuristics to dynamically detect when is the best time to switch threads. SLICC works well with high core counts where the aggregate L1-I cache capacity is sufficient to hold the actively accessed set of instructions, however it performs sub-optimally or may hurt performance when running on fewer cores. Since SLICC outperforms STREX when enough cores exist, and vice versa otherwise, this dissertation proposes a hybrid technique that combines STREX and SLICC, thereby guaranteeing maximum benefits regardless of the number of available cores and the workload's footprint. For a 16-core system, evaluation shows that SLICC and STREX respectively reduce instruction misses by 64% and 37%, resulting in overall performance gains of 67% and 49%, and energy reductions of 26% and 20%, on average.;Big Data applications have emerged to make sense of and to extract value from the digital data deluge. As these applications operate on large volumes of semi-structured data, they exhibit intricate irregular, non-repetitive memory access patterns, exacerbating the effect of the much slower main memory. Worse, their irregular access streams tend to be hard to predict stressing existing data prefetching mechanisms.;This dissertation revisits precomputation prefetching targeting long access latency loads as a way to handle access patterns that are hard to predict. It presents Ekivolos, a precomputation prefetcher system that automatically builds prefetching slices that contain enough control flow and memory dependence instructions to faithfully and autonomously recreate the program's access behavior without incurring monitoring and execution overheads at the main thread. Ekivolos departs from the traditional notion of creating optimized short precomputation slices, and in contrast focuses on accuracy showing that even longer slices can run ahead of the main thread as long as they are sufficiently accurate. Ekivolos operates on arbitrary application binaries and takes advantage of the observed execution paths in creating its slices. On a set of emerging workloads Ekivolos is shown to outperform three state-of-the-art hardware prefetchers and a model of past, dynamic precomputation-based prefetchers.
Keywords/Search Tags:Emerging, Data, Workloads, Techniques, Memory, SLICC, STREX, Transaction
Related items