Font Size: a A A

Optimization Of PageRank Algorithm On Hadoop

Posted on:2014-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:S B LiaoFull Text:PDF
GTID:2208330434471010Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As social network and semantic network spring up in recent years, massive data mining draws the attention of researchers and engineers. Single server does not have enough storage capacity and computing capability for massive data analysis. The open source distributed framework—Hadoop, developed and managed by Apache Software Foundation has been widely used in big data related product and application.In the traditional data mining implementation, source data is stored in local disk, the data mining program will read data from disk to data structure in memory, with help of index, perform computing efficiently, and output to local disk, console or remote client finally. For program in single server, developer only need to consider thing like effectiveness and space/time complexity, how to choose data structure and how to visualize the result.As data volumes grow, the disk capacity is not adequate for the data storage and memory capacity is also not adequate for data during computing. So we need to change the traditional algorithms to the distributed algorithms to make use of a cluster of multiple servers. During the algorithm transformation, a lot of things need to be concerned, such as data distribution, network between each node, failure tolerance. On Hadoop distributed computing platform, developer can only care about the algorithm itself, and Hadoop will take charge of the network, fault tolerance and so on.When an algorithm is transformed to be the Hadoop version, it will be in the form of Map (local computing and data redistribution)â†'network transmissionâ†'Reduce(result collection and aggregation). It is a challenge to transform data mining algorithms to Hadoop platform for academia and industry. How to distribute massive data, how to choose the execution unit for key and value, how to reduce the network cost?PageRank is a very popular webpage sorting algorithm, named after Larry Page and used by the Google web search engine, to give each webpage a score to indicate the importance of a webpage in the web. With the development of the Internet, the amount of webpage grows exponentially, beyond the capacity of single server. For the purpose of scalability, people hope to transform PageRank algorithm to Hadoop distributed platform, so when data amount increases, we only need to add some server to the cluster dynamically.But through the experiments, the Hadoop version PageRank meet the demand of scalability, but performance is not good. Given this problem, this paper proposes an optimized PageRank algorithm on Hadoop, core idea of the optimization is to change granularity of key and value of MapReduce by use of graph clustering, which can reduce the network cost between Map and Reduce, balance computing resources, improve the overall efficiency. As PageRank is used not only for sorting webpage but also computing node importance for other graph, if the graph itself is very sparse or the clustering is ineffective, the optimization maybe not applicable. So we build a naive cost model to check the optimization work or not, if latter, the original PageRank algorithm will be used.This paper explains how to implement and optimize iterative algorithm like PageRank on Hadoop distributed platform, propose a idea of change the unit of MapReduce from node to subgraph by graph partitioning, in order to reduce the network cost between map and reduce, balance the computing resources, increase the overall performance. And the optimization provide a new idea applicable for some iterative graph mining algorithm on Hadoop platform.
Keywords/Search Tags:Hadoop, PageRank, graph partitioning
PDF Full Text Request
Related items