Font Size: a A A

Design And Implementation Of PageRank Computing System Based On MapReduce

Posted on:2012-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:L XiaFull Text:PDF
GTID:2268330425991617Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the advent of information age, all kinds of information grow explosively, and resulting in the size of graph is increasing which is larger than before. PageRank and the analysis of social network increasingly are becoming hot topics. PageRank, for example, we must store10billion vetices, if each vetex’s storage is about100bytes, then the storage will reach about900G. Such large of network data size not only will give us challenges to deal with them effectively, but also there are some opportunities. Dealing with so large graph data effectively has become the urgent needs for economic and social development.According to the background above, this thesis takes PageRank as an example, constructing a computing system based on MapReduce which is the sub-project of Hadoop. The main jobs are as following:First, data crawling, we use Heritrix as a tool to crawl web pages. So we get nodes’ URLs and the relationship among URLs.Secondly, as we know, if we transfer the whole node’s information between PageRank computing, that is terrible. So, in data processing module, we build adjacency list to represent node’s URL and the relationship between URL and URL.Thirdly, in PageRank computing module, we use the data from adjacency list to compute PageRank for each node. We proposed three methods to compute PageRank, which are the native PageRank computing algorithm which converts inlinks to outlinks, called Native-PR, the one based on only one job in each iteration, called OIOJ-PR, another one based on sub-graph partitioning, called SGPB-PR. The Native-PR algorithm based on conversion of between inlinks and outlinks is not efficient because of starting two jobs in a PageRank calculation courses. Computing PageRank based on only one job is more efficient than the conversion of between inlinks and outlinks attributed to only one job during the procedure of computing PageRank and combinering in local. The SGPB-PR algorithm partitions the graph node’s information based on URL into several sub-graphs, and each Map function is responsible for a sub-graph.Finally, we use prefuse, which is an open source tool to show the node’s information visiualized. As prefuse does not provide HDFS interface, we design the procedure to load graph data from HDFS to database.
Keywords/Search Tags:MapReduce, data crawling, adjacency list, PageRank, data visualization
PDF Full Text Request
Related items