Research And Optimization Of Distributed Crawler System Based On Nutch

Posted on:2016-11-04

Degree:Master

Type:Thesis

Country:China

Candidate:D Jing

Full Text:PDF

GTID:2428330542457286

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Along with the rapid development of the technology about internet,the world produces the amount of data every day,the key technology related big data have been developing.The cloud computing technology has developed into the computer industry and academia research hot spot.Hadoop with good scalability and reliability have been becoming a widely used cloud platform,and got the attention of many researchers.Nutch is a open source search engine with Java language developent.It support for distributed crawler and the underlying using Hadoop,more and more scholars are taking the study of improving the efficiency of the distributed search through various ways.First,the thesis analyses the Hadoop platform and Nutch framework,including plugin mechanism and index mechanism of Nutch and HDFS with Map/Reduce computing model.Research and analyses several common web page to heavy algorithm and sorting algorithm.On this basis,according to the native Nutch system deficiency in the web page to weight and sorting,the thesis proposed based on weighted feature words of extraction of web pages to heavy algorithm and the PageRank algorithm based on fingerprint importance with the page document.The first algorithm use the method of weighted to extract the web content and the simHash algorithm to document representation into characteristic fingerprint collection and then,collection and calculate to determine whether a page similar to Jaccard coefficient.This PageRank algorithm use documents fingerprint to measure the theme of similarity between web pages.And then,according to the page on the number of inbound links to allocate the PageRank values obtained to improve the theme of the drift problem in traditional PageRank algorithm and weighting average allocation problem.Finally,this thesis provides Map/Reduce implementationin in the system of the two algorithms.Finally,we build Hadoop and Nutch experimental environment and do experiment of the two algorithms.The experimental results show that the dplicate removal algorithm has good effect and time efficiency,and the two sides reached a good balance and the sort algorithm has the higher precision and stability compared with the traditional PageRank algorithm.

Keywords/Search Tags:

Distributed Crawler System, Nutch, The Weight of Web Pages, The Sort of Web Pages, Fingerprint of Page

PDF Full Text Request

Related items

1	Design And Implementation Of Distributed Web Crawler System Supporting Dynamic Web Pages Paring
2	Sort Of Facing Pages Keyword Weight Calculation
3	Research On The Technology Of Incremental Web Pages Crawler
4	Research And Implementation On Key Technologies For Distributed Universal Web Crawler System
5	Research And Implementation Of Distributed Crawler Technology
6	Ergonomic Study On Single-page And Multi-pages Of Web Message Presentation
7	Detection Of Near-replicas Of Web Pages Based On Text Structure
8	Design And Implementation Of Text Information Extracting Modules Of Html Web Pages Based On DOM
9	Crawl The Web Object-based Distributed And Stored In The Design And Research
10	The Topic Of Science And Technology Projects Search Engine Based On Nutch