Font Size: a A A

Customizing A Chip Multiprocessor For The Breadth-first Search Algorithm

Posted on:2015-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:J F YangFull Text:PDF
GTID:2308330452469523Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The high power consumption has seriously restricted the development of future highperformance computing systems. As a result, the processor design has been moreand more focused on the improvement of the performance-per-watt. The processorarchitecture customizing for some crucial applications has become one of the mostimportant technology-trends. On the other hand, the applications of graph dataprocessing have been used more and more widely in many areas. As the basis ofmany graph algorithms, the breadth first search (BFS) algorithm becomesincreasingly important.With the development of multi-/many-core processor-architectures, more and morecores will be integrated into a single chip. Thus, mapping the BFS algorithm to theChip Multiprocessor (CMP) faces the following challenges: long memoryaccess-latencies followed by the relatively simple computation, the poor data locality,load-imbalance and synchronization overheads of the parallel execution. This paperproposes a hardware/software co-optimized solution for the efficient and scalableBFS algorithm on a large scale CMP.1) In our algorithm, we divided the BFSalgorithm into two stages, the memory access part and the computing part; morethreads (or cores) has been used to run the first part that is relatively slow, which canhide access latencies and parallelize memory accesses and computation. At the sametime, we used different data-partitioning methods respectively for the memory accesspart and the computing part, to achieve the load-balance while avoidinglock-operations.2) We customized a processor core uses the local memory on-chipinstead of the last level cache to store the frequently accessed discrete data. Thismethod improves the performance and decrease the power-consumption.3) We useda fast on-chip network for the rapid data-transferring between cores.Our experiments indicate that, with the appropriate algorithm design andarchitecture support, graph data processing can be implemented efficiently on thepower efficient CMP. The simulation of a customized16-Xtensa-core CMP hasshown that our BFS algorithm can achieves90%of the performance of an IntelXeon X5560processor running the Graph500’omp-csr’ benchmark(the CMOStechnology is the same), while the chip area is10%and the TDP is1.6%of the later.As a result, the performance-per-watt increases41.98times.
Keywords/Search Tags:breadth‐first search, multiprocessor, customized processor
PDF Full Text Request
Related items