Research And Optimization Of Distributed Reptiles Based On

Posted on:2016-01-23

Degree:Master

Type:Thesis

Country:China

Candidate:W Hu

Full Text:PDF

GTID:2208330461984902

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the advent of the era of big data, data on the Internet is rapidly expanding and become larger, and the speed of data collection can not meet the actual needs. Crawler system needs to crawl a huge number of web pages, how to crawls the pages efficiently and stably is very important. The wide distribution and dynamic changing of the pages also makes it difficult to maintain the local copies of the pages fresh. Crawler need to update the copies of the pages in time to avoid invalid pages.In this paper, we improve crawler of Nutch and combine it with Hadoop distributed platform to design efficient and reliable distributed crawler system. The main findings are as follows:1. Nutch and Hadoop distributed platformWhen running stand-alone, Nutch is prone to single points failure and unstable because the performance of the storage and computing of a single machine is limited. Rely on the advantages of Hadoop distributed platform, each step of Nutch running will be submitted to Hadoop and complete by Map Reduce through distributed computing, the data will be stored on HDFS. We compared the experiments of stand-alone Nutch mode and distributed Nutch model, the results show that as the number of nodes increases in the distributed cluster, the performance of Nuch crawling has linear growth. The system has improved data security, reliability and load balancing between nodes.2. propose Proxy IP dynamic replacement moduleAfter a detailed analysis of Nutch crawl the web data workflow we found that when a site is based on IP access detection mechanism, massive Nutch access can be forbided easily. To address this problem, Proxy IP dynamic replacement module is proposed, combine with Nutch system. When Nutch crawling is forbidden, replace the Proxy IP, so Nutch crawling continue. After tests, Nutch crawling forbidden was solved effectively.3. pages changing forecast optimizationNutch has web update module, but the pages changing cycle requires artificial setting and are valid for all pages. It is difficult to adapt to the differentiation of the massive web. This paper presents a dynamic selection strategy to predict the pages changing cycle. When the historical updated web pages data is insufficient, the strategy use DBSCAN clustering algorithm based on Map Reduce to reduce the number of the pages that the crawler system crawling. The update cycle of the sample web pages are used as other pages which are in the same category; when being enough, the data is used to model with the Poisson Process, which can more accurately predict each web page update cycle. Experiments show that the dynamic selection strategy saves crawl resources, and can more accurately predict the page changing cycle.

Keywords/Search Tags:

Nutch, Crawler, Hadoop, Proxy IP, Web Changing Prediction, DBSCAN algorithm, MapReduce, Poisson Model

PDF Full Text Request

Related items

1	Application And Research Of DBSCAN Based On Hadoop Platform
2	Study Based On Hadoop Distributed Web Crawler
3	Design And Implementation Of Nutch Crawler System Based On Linked In And Microsoft Academic
4	The Mapreduce Model In The Hadoop Implementation Of Performance Analysis And Optimization Improvements
5	The Research And Implementation Of Distributed Topic Web Crawler Based On Nutch
6	The Research Of Data Optimization And Application Of Clustering Algorithm Based On Hadoop
7	Design And Implement Of Video Crawler System Based On Hadoop
8	Research Of Clustering Algorithm Based On Cloud Computing Platform
9	Analysis And Research On Parallel Clustering Algorithm Based On Hadoop
10	Design And Implementation Of Distributed Network Crawler System