Font Size: a A A

Research On Key Technologies In Scalable Shared-Memory Systems

Posted on:2020-01-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y HongFull Text:PDF
GTID:1368330623463989Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Represented by Map Reduce,graph analytics and deep learning applications,large-scale in-memory computing applications are increasingly demanding more computing resources.However,with the slowdown of the Moore's law indicates that applications can no longer gain scalability from the evolution of processors.Therefore,the multiprocessor technology has become the standard in the industry.Shared memory is the crucial abstraction that current multiprocessors provide,from multi-core processors to distributed systems.The research on shared memory systems for multiprocessors have been always a hot topic in the academy.The design goal of shared-memory multiprocessor systems is providing scalable performance to applications.In other words,software can achieve scalable speedup by adding more processors.There are two basic type of shared-memory multiprocessors.In a single-machine system,shared-memory multiprocessors are composed of several cores using bus structures or high-performance interconnects.The representative systems are multi-core processors and many-core processors.Another scenario is distributed shared memory systems.Such systems are implemented by connecting multiple servers using external network infrastructure,making it even easier to scale out by connecting more servers.However,there are some common challenges in achieving scalable performance in the design of shared-memory multiprocessors in both single-machine and distributed scenarios.The first challenge is to strike a balance between intuitive memory consistency model and scalable performance.When multiple processors access the shared data,it is inevitable that loads and modifications to the same data may happen concurrently.Without a consistency guarantee,the correctness of the system is not ensured due to data loss,data corruption and concurrency bugs.But stronger consistency guarantee comes with the cost of degraded scalability because less reordering and optimizations are possible with strict memory ordering constraints.The second challenge is to implement efficient synchronization between threads.Synchronization is essential for multi-threaded applications to communicate and deal with shared accesses to data while not incurring concurrency bugs.Synchronization algorithms imposes non-trivial over-head on parallel workloads,increasing the non-parallel portion of execution and resulting in bad scalability.The third challenge is to balance the programmability with performance.To ease the burden on programmers to design multi-threaded applications,shared-memory multiprocessor systems typically provide various mechanisms to help applications to convey semantics to the hardware,such as fences and atomics instructions.Such mechanisms contributes to the correctness of applications and also incur overhead to the performance.In the thesis,we analyze th behavior of traditional parallel workloads and emerging largescale in-memory applications to understand the memory access patterns.Combined with the hardware interface and features,we analyze the bottleneck of performance and scalability.Aimed at improving the scalability and performance of emerging applications,we explore the cooperation of hardware and software to match the feature of applications in the single-machine and distributed settings.The thesis makes the following contributions:1.Exploration of the root cause of sequential consistency violations in the single-machine shared-memory multiprocessor systems and the defect of existing fence-based ordering enforcement.We find that the current fence mechanism results in head-of-line blocking,unnecessary delay of instructions.Previous hardware-based solutions is either complicated or sub-optimal while software-based solutions cannot fundamentally eliminate the overhead.We propose a new hybrid solution that incorporate hardware extensions and compiler analysis to enforce sequential consistency without fences.The hardware components leverage the information conveyed from the static analysis to dynamically detect the hazard of violations and effectively enforce the ordering.The evaluation results show that the proposed approach improve the performance of synchronization constructs by 10% and reduce of overhead of enforcing sequential consistency in SPLASH-2 and PARSEC applications from 42% to 3%.2.We analyze the memory access patterns and synchronizations in large-scale in-memory computing applications and come to the conclusion that such applications typically feature good memory access locality and coarse-grained synchronization.We revisit the design of distributed shared memory systems and implemented a prototype based on fast network.We propose four optimization techniques to capture the corresponding features in emerging applications in terms of reducing the number of pages faults,accelerating TLB shootdowns and protocol processing.We further propose a hybrid consistency model to provide both a high-performance sequential consistency model with the flexibility to override the consistency constraints with application-directed customized data accesses.The results show that our optimizations improves the performance of graph analytic applications by up to 9.25 times and significantly improves the scalability.3.We analyze the low-level RDMA primitives and compare the features of one-sided and two-sided operations.Based on the analysis of the dependency between protocol operations and the findings of the best practice for both primitives,we proposed an RDMA-based implementation of the distributed shared memory protocol.This implementation incorporates both one-sided and two-sided operations for the most suitable scenarios.With a set of optimizations such as delayed TLB shootdowns and overlapping RDMA requests,the protocol can reduce up to 42% processing time and achieve better scalability.
Keywords/Search Tags:Shared Memory Systems, Multicore Synchronizion, Remote Direct Memory Access, Scalability
PDF Full Text Request
Related items