Design And Implementation Of PageRank Computing System Based On MapReduce

Posted on:2012-02-21

Degree:Master

Type:Thesis

Country:China

Candidate:L Xia

Full Text:PDF

GTID:2268330425991617

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the advent of information age, all kinds of information grow explosively, and resulting in the size of graph is increasing which is larger than before. PageRank and the analysis of social network increasingly are becoming hot topics. PageRank, for example, we must store10billion vetices, if each vetex’s storage is about100bytes, then the storage will reach about900G. Such large of network data size not only will give us challenges to deal with them effectively, but also there are some opportunities. Dealing with so large graph data effectively has become the urgent needs for economic and social development.According to the background above, this thesis takes PageRank as an example, constructing a computing system based on MapReduce which is the sub-project of Hadoop. The main jobs are as following:First, data crawling, we use Heritrix as a tool to crawl web pages. So we get nodes’ URLs and the relationship among URLs.Secondly, as we know, if we transfer the whole node’s information between PageRank computing, that is terrible. So, in data processing module, we build adjacency list to represent node’s URL and the relationship between URL and URL.Thirdly, in PageRank computing module, we use the data from adjacency list to compute PageRank for each node. We proposed three methods to compute PageRank, which are the native PageRank computing algorithm which converts inlinks to outlinks, called Native-PR, the one based on only one job in each iteration, called OIOJ-PR, another one based on sub-graph partitioning, called SGPB-PR. The Native-PR algorithm based on conversion of between inlinks and outlinks is not efficient because of starting two jobs in a PageRank calculation courses. Computing PageRank based on only one job is more efficient than the conversion of between inlinks and outlinks attributed to only one job during the procedure of computing PageRank and combinering in local. The SGPB-PR algorithm partitions the graph node’s information based on URL into several sub-graphs, and each Map function is responsible for a sub-graph.Finally, we use prefuse, which is an open source tool to show the node’s information visiualized. As prefuse does not provide HDFS interface, we design the procedure to load graph data from HDFS to database.

Keywords/Search Tags:

MapReduce, data crawling, adjacency list, PageRank, data visualization

PDF Full Text Request

Related items

1	The Study And Implementation Of Efficient And Stable Methods For Data Crawling In Vertical Search Engines
2	Research On Causal Consistency Model Based On Grouping Strategy And Adjacency List
3	Design And Implementation Of Visualization System For Movie Website Data Mining
4	The Research Of Data Warehouse ETL Distribution And Scheduling Model
5	The Research Of Data Warehouse Etl Distribution And Scheduling Model
6	Optimization And Implementation Of PageRank Using MapReduce
7	The Research And Application Of Collaborative Filtering Algorithm With Big Data Visualization Model
8	Research And Implementation Of Multidimensional Data Visualization Tool Based On Web
9	A Research On Personalized PageRank Based On MapReduce
10	Research On Spatiotemporal Data-oriented Mining Algorithms