Font Size: a A A

Research And Implementation Of Performance Optimization For PowerGraph

Posted on:2020-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z B DongFull Text:PDF
GTID:2370330596475097Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet has made the network graph a hot topic of research and analysis.At the same time,the technology of graph-oriented data structures such as machine learning and data mining has been widely used in fields like social network analysis,web search,natural language processing and recommendation system.PowerGraph proposes GAS(Gather Apply Scatter)abstraction and vertex-cut graph partitioning algorithm.This thesis analyzes the network communication behavior of the message transmission model and data cache mechanism during the execution of graph algorithms.It is found that the original message transmission model in PowerGraph applies only the push mode,which may result in the master replica not being able to get the latest message from the mirror replica in time,or the mirror replica may send multiple messages to the master replica.Additionally,the original data cache mechanism in PowerGraph can only reduce the computational overhead,but cannot reduce the network overhead,and developers must implement the cache-related codes by themselves,thus increasing their cost of use.To address the aforementioned major challenges,this thesis put forwards corresponding solutions respectively on the basis of a relatively exhaustive research and analysis.The main content includes the following two aspects.In order to solve the problem of message transmission model in PowerGraph,a new model is proposed as follows: Firstly,the message received by the mirror replica is always stored locally and will be merged with the newly received messages until the mirror replica is scheduled to send the merged message to the master replica;Secondly,when the master replica begins a new iteration,it sends a request to each mirror replica,searches and retrieves the message that has been generated but not yet transferred to the master replica.The new hybrid push-pull message model enables the master replica to get the latest messages in the mirror replicas in time,avoiding the need to perform additional iterations,while enabling the mirror replicas to merge messages as many as possible,reducing the network overhead of transmitting messages.To compensate for the flaws brought by the data cache mechanism in PowerGraph,this thesis proposes a technique for accelerating iterations by setting caches at the master replica.Specifically,the cache corresponding to each replica of the vertex is to be stored locally on the master replica.In this way,if the master replica performs the Gather phase,and meanwhile the corresponding cache of mirror replica is valid,it is unnecessary for the master replica to retrieve data from the node where the mirror replica resides across the network.As a result,the network overhead in the process of Gather can be cut down.This thesis realizes the optimization mechanisms mentioned previously.The improved framework achieves the complete transparency of the upper applications,so that all existing applications can be executed in the new system without any modification.The applications in PowerGraph's own toolset are selected to compare the number of communication times,computation times and execution time of the new system and the original system under different applications,datasets and machine numbers.Finally,the experimental results illustrate that the proposed scheme can deliver speedups of 167% to 271% for different applications.
Keywords/Search Tags:distributed system, graph computing, PowerGraph, message transmission, data cache
PDF Full Text Request
Related items