Font Size: a A A

Research And Optimization Of Distributed Crawler System Based On Nutch

Posted on:2016-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:D JingFull Text:PDF
GTID:2428330542457286Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of the technology about internet,the world produces the amount of data every day,the key technology related big data have been developing.The cloud computing technology has developed into the computer industry and academia research hot spot.Hadoop with good scalability and reliability have been becoming a widely used cloud platform,and got the attention of many researchers.Nutch is a open source search engine with Java language developent.It support for distributed crawler and the underlying using Hadoop,more and more scholars are taking the study of improving the efficiency of the distributed search through various ways.First,the thesis analyses the Hadoop platform and Nutch framework,including plugin mechanism and index mechanism of Nutch and HDFS with Map/Reduce computing model.Research and analyses several common web page to heavy algorithm and sorting algorithm.On this basis,according to the native Nutch system deficiency in the web page to weight and sorting,the thesis proposed based on weighted feature words of extraction of web pages to heavy algorithm and the PageRank algorithm based on fingerprint importance with the page document.The first algorithm use the method of weighted to extract the web content and the simHash algorithm to document representation into characteristic fingerprint collection and then,collection and calculate to determine whether a page similar to Jaccard coefficient.This PageRank algorithm use documents fingerprint to measure the theme of similarity between web pages.And then,according to the page on the number of inbound links to allocate the PageRank values obtained to improve the theme of the drift problem in traditional PageRank algorithm and weighting average allocation problem.Finally,this thesis provides Map/Reduce implementationin in the system of the two algorithms.Finally,we build Hadoop and Nutch experimental environment and do experiment of the two algorithms.The experimental results show that the dplicate removal algorithm has good effect and time efficiency,and the two sides reached a good balance and the sort algorithm has the higher precision and stability compared with the traditional PageRank algorithm.
Keywords/Search Tags:Distributed Crawler System, Nutch, The Weight of Web Pages, The Sort of Web Pages, Fingerprint of Page
PDF Full Text Request
Related items