| With the emergence of the Internet+,digital society,and economy industry,graph data,which expresses correlations,has been widely penetrated to serve all corners of essential areas such as economic construction,national security,and social life.In the Big Data era,the scale and complexity of graph data are continuously expanding,and graph processing systems often exhibit problems such as low cache hit rate and low computational access ratio.In recent years,with the slowdown of Moore’s law and Denard’s scaling law,these problems have gained more prominence,and the traditional architecture with separate storage and computation can hardly meet the growing performance and energy efficiency requirements of graph applications.The development of graph processing faces two serious challenges: the memory wall and the power wall.The emergence of processing-in-memory(PIM)devices,represented by memristors and hybrid memory cubes,has opened up new opportunities for the development of graph processing.By integrating the computation unit and the storage unit into a single device,this type of memory device breaks the barrier of separation between storage and computation in traditional processor architectures,which can fundamentally avoid a large amount of data movement during graph processing.However,existing research on PIM architecture for graph processing has several limitations.First,the parallelism of the computational architecture is insufficient.With the extremely rapid expansion of graph data scale,the demand for parallelism in graph processing is increasing.Therefore,a planar memristor architecture is difficult to meet the ultra-high parallelism requirements of realworld graph applications.Second,the sparse topology of real-world graph leads to extremely low execution efficiency of graph processing loads on regularized crossbar-based architectures,making it difficult to fully utilize the computational potential of PIM architectures.Third,current research work is usually limited to simple graph algorithm,which is difficult to adapt to realistic and complex graph computation scenarios.To deal with the above issues,we propose several software-hardware co-designs as follows.Massively-parallel monolithic 3D memristor for graph processing: To overcome the limited parallelism of existing graph processing accelerators,we propose RAGra,the first 3D memristor-based graph processing accelerator,theoretically improving the parallelism of PIM architectures.First,RAGra proposes novel mapping schemes that can guide applying different graph algorithms into 3D memristor seamlessly and correctly to expose the inherently parallelism of the 3D memristor.Second,considering the sparsity of real-world graphs,RAGra further proposes a sparsity-aware memory subsystem,which can filter invalid subgraphs for exploiting the massive parallelism of the 3D memristor.Experimental results show that RAGra can achieve significant performance and energy efficiency gains over state-of-the-art planar architecture graph processing accelerators.High-efficiency heterogeneous PIM architecture for graph processing: Real-world graph analytics workloads are often irregular.Mapping them into memristor crossbars shows that substantial cells are stored and computed but not used.We introduce a new heterogeneous PIM hardware,called Hetraph,to facilitate high-efficiency graph processing.Hetraph incorporates memristor-based analog computation units for high-parallelism computing and CMOS-based digital computation cores for efficient computing on a single device.To maximize the hardware utilization,our software design offers a hardware heterogeneity-aware execution model and a workload offloading mechanism.Our results show that Hetraph can significantly outperform the state-of-the-art PIM-based solutions in terms of both performance and energy savings.PIM-enabled architecture for graph learning with kernel abstraction: Existing PIM-based accelerators for graph processing are limited to classical graph algorithms,making it challenging to cope with complex graph computing applications such as graph learning.We present a new graph learning accelerator,Re Flip,with three key innovations in terms of architecture design,algorithm mappings,and practical implementations.First,Re Flip leverages PIM-featured crossbar architectures to build a unified architecture for supporting the two types of graph learning kernels simultaneously.Second,Re Flip adopts novel algorithm mappings to maximize potential performance gains reaped from the unified architecture.Third,Re Flip assembles software and hardware co-desings to process realworld graphs efficiently.Results show that Re Flip can significantly outperform the state-ofthe-art accelerator solutions in terms of both performance and energy efficiency. |