Font Size: a A A

Research And Optimization Of Distributed Reptiles Based On

Posted on:2016-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:W HuFull Text:PDF
GTID:2208330461984902Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data, data on the Internet is rapidly expanding and become larger, and the speed of data collection can not meet the actual needs. Crawler system needs to crawl a huge number of web pages, how to crawls the pages efficiently and stably is very important. The wide distribution and dynamic changing of the pages also makes it difficult to maintain the local copies of the pages fresh. Crawler need to update the copies of the pages in time to avoid invalid pages.In this paper, we improve crawler of Nutch and combine it with Hadoop distributed platform to design efficient and reliable distributed crawler system. The main findings are as follows:1. Nutch and Hadoop distributed platformWhen running stand-alone, Nutch is prone to single points failure and unstable because the performance of the storage and computing of a single machine is limited. Rely on the advantages of Hadoop distributed platform, each step of Nutch running will be submitted to Hadoop and complete by Map Reduce through distributed computing, the data will be stored on HDFS. We compared the experiments of stand-alone Nutch mode and distributed Nutch model, the results show that as the number of nodes increases in the distributed cluster, the performance of Nuch crawling has linear growth. The system has improved data security, reliability and load balancing between nodes.2. propose Proxy IP dynamic replacement moduleAfter a detailed analysis of Nutch crawl the web data workflow we found that when a site is based on IP access detection mechanism, massive Nutch access can be forbided easily. To address this problem, Proxy IP dynamic replacement module is proposed, combine with Nutch system. When Nutch crawling is forbidden, replace the Proxy IP, so Nutch crawling continue. After tests, Nutch crawling forbidden was solved effectively.3. pages changing forecast optimizationNutch has web update module, but the pages changing cycle requires artificial setting and are valid for all pages. It is difficult to adapt to the differentiation of the massive web. This paper presents a dynamic selection strategy to predict the pages changing cycle. When the historical updated web pages data is insufficient, the strategy use DBSCAN clustering algorithm based on Map Reduce to reduce the number of the pages that the crawler system crawling. The update cycle of the sample web pages are used as other pages which are in the same category; when being enough, the data is used to model with the Poisson Process, which can more accurately predict each web page update cycle. Experiments show that the dynamic selection strategy saves crawl resources, and can more accurately predict the page changing cycle.
Keywords/Search Tags:Nutch, Crawler, Hadoop, Proxy IP, Web Changing Prediction, DBSCAN algorithm, MapReduce, Poisson Model
PDF Full Text Request
Related items