Font Size: a A A

Research On BFS Algorithm Under Multiple Architectures

Posted on:2019-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z W PengFull Text:PDF
GTID:2428330611993633Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Breadth-First Search(BFS)is the underlying kernel algorithm for many graph applications such as social networks,medical informatics,transport systems,etc.Therefore,it has been absorbed as a core of Graph500,used to evaluate the capability of supercomputers in terms of big data processing.This paper is based on the Guangzhou Supercomputer Center in the Milky Way ?provides a platform for distributed memory architecture and shared memory architecture of BFS algorithm.First of all,we quantitatively analyze the performance bottleneck of BFS algorithm in distributed memory architecture.And put forward the optimization method of load balancing on the distributed memory architecture,the "top-down" phase synchronization is key information accessed,in "bottom-up" phase synchronization key queue information to achieve balance in each process communication times.At the same time,this paper to reduce the length of message and communication through the local marker method to reduce the number of invalid communication by way of asynchronous communication library to re package small message headers,reduce the communication of distributed memory architecture.Secondly,according to the Knights Landing(KNL)chip to optimize BFS algorithms for shared memory architecture.For the heterogeneous memory hierarchy of KNL,this paper uses KNL chip to provide the Flat model,the performance difference between MCDRAM and DRAM,the vertex data,the parent node data distribution in MCDRAM,the edge data,temporary variables and array allocation in DRAM on the way to achieve read and write data and faster memory access.At the same time,to update the cache strategy,the main work of this paper is to use Clik algorithm for fine-grained redivides,to ensure that each work is suitable for a 8 thread reads a CPU vertex cache line value,and will not update the values of each of the other effects of cache line.Finally,we evaluated the experiment in Tianhe-2 supercomputer and KNL chip.The experimental results show that the distributed memory architecture,our method can effectively shorten the communication time,compared with the top-down method,obtained about 3 times speedup.In the KNL platform,our method is effective in reducing the last level cache of load and store the number of instructions,which was about 10 times speedup in ramt.
Keywords/Search Tags:Graph500, Breadth-First Search, KNL, Communication optimization, Cache update strategy
PDF Full Text Request
Related items