Font Size: a A A

Research On Parallelization Of Community Discovery Algorithm Based On Hadoop

Posted on:2018-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:B Z LaiFull Text:PDF
GTID:2348330518961609Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Community discovery algorithm is the main method to study community structure in complex networks.With the explosive growth of the network scale,the traditional single and serial community discovery algorithms are not suitable for dealing with the large-scale network.As a new kind of large data processing technology,Hadoop is favored by many developers because of its high scalability,high reliability and simple programming model.Aiming at the limitation of the current serial community discovery algorithm to deal with large-scale network,this paper proposes two parallel community discovery algorithms based on the advantages of Hadoop framework in large data processing.In order to solve the problem of high complexity of Fast-Newman algorithm,this paper proposes a parallel Fast-Newman algorithm based on Hadoop framework.The main difficulty of the parallel Fast-Newman algorithm is that each node can not get the information of its neighbor nodes in the map function,so it is necessary to provide the global graph information with the help of Web services.The improved Fast-Newman algorithm will parallel calculate the increment of the module value after each node merged with its neighbor node in the map function.In the reduce function,we will find out the two nodes with the largest increment of the module value and Merge them,this is a merge process.The combined results are re entered as the input of the map function,and the map and reduce processes are iterated excuted until all nodes are merged into one community.Using Ego-Facebook as a data set,the experimental results show that the parallel Fast-Newman algorithm has a higher speedup.In order to deal with the problem of large-scale network,this paper proposes a Fast-Unfolding parallel algorithm PFU(Parallel-Fast-Unfolding)based on Hadoop.The algorithm mainly uses the idea of "divide and conquer".Firstly,the large-scale network is partitioned and merged alone,and then the network is reconstructed according to the result of each partition.Finally,merge the reconstructed network until the community structure not changed.There are two difficulties in the parallelization scheme: one is how to ensure that the connection information is not lost after partition;the two is how to reconstruct the network after the partition is completed.this paper solves the problem by improving the data storage method and designing the reconstruction scheme.The experimental results show that the PFU algorithm can significantly improve the efficiency of the algorithm and has good scalability on the basis of ensuring the accuracy of the algorithm.Finally,according to the map and reduce phase output of the intermediate file,using gephi software to visualize the results improve the application value of PFU algorithm.
Keywords/Search Tags:community discovery, hadoop, large-scale network, parallelization, visualization
PDF Full Text Request
Related items