Font Size: a A A

Optimization And Implementation Of PageRank Using MapReduce

Posted on:2017-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:D X MengFull Text:PDF
GTID:2308330491450802Subject:Data mining
Abstract/Summary:PDF Full Text Request
Wth the surge of Internet data, analysis and information mining of the huge amounts of data face the bottleneck in terms of computing power and storage space. MapReduce is a programming model for processing parallelizable problems across huge datasets with a parallel, distributed algorithm on a cluster. Using MapReduce can effectively solve the problems faced when dealing with massive data. The model mainly combines the grid computing, parallel, distributed and other technologies, which not only reduces the requirements of the terminal equipment, but also improves the ability of data processing. Aiming at the shortcomings of the classic PageRank page ranking algorithm based on link relations, this paper optimizes the web page ranking algorithm, and design the optimization algorithm suitable for MapReduce distributed computing model.The main work of this paper is as follows:(1) Web structure mining theory is analyzed in depth, and the PageRank algorithm, HITS algorithm, SALSA algorithm are analyzed and compared with each other.(2) Aiming at the four disadvantages(topic drift, weight average, laying particular stress on the old page, and interest-independence) of the PageRank algorithm, the optimization algorithm I-PR as the corresponding solution to the disadvantages is proposed. Moreover, its superiority in ranking web pages is verified by experiments.(3) I-PR algorithm for MapReduce computing model is designed to solve such problems of the traditional PageRank algorithm during serial processing as low efficiency and storage difficult. Moreover, experiments have been conducted on Hadoop framework.
Keywords/Search Tags:Link analysis algorithm, Hadoop, MapReduce, PageRank
PDF Full Text Request
Related items