Font Size: a A A

Research And Implementation Of Distributed Web Crawler

Posted on:2018-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:M WangFull Text:PDF
GTID:2348330515458276Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,people in the work and life of the Internet information needs more and more,the importance of search engine technology is more obvious.Internet information has a very wide range of applications in many aspects,search engine technology has been deeply rooted into the people's lives,the impact on people's lives growing,and Web crawler is a very important part of the search engine.At present,based on the single machine crawler crawl ability can not complete the current needs of the Internet,thus contributing to the emergence of distributed Web crawler technology.The distributed system and the cooperation division among multiple machines can be realized,and the computing speed of the large data volume is improved and the crawling performance of the network crawler is improved.The use of distributed storage,the entire system of data storage performance can be greatly improved.In this thesis,based on the shortcomings of stand-alone Web crawler,the thesis introduces the distributed web crawler in detail,and designs and implements the distributed web crawler based on Hadoop platform.The main work of this thesis is as follows:(1)Introduced the working principle and key technology of search engine technology,distributed web crawler,the architecture design of distributed network crawler system,analyzed the concrete realization flow and realization principle of the key modules,and the realization mode of each module MapReduce.(2)Some algorithms and strategies of existing crawler can not meet the demand,and these two aspects are optimized.The optimization includes URL weight algorithm and URL de-emphasis strategy,which greatly improves the crawling speed of web crawler.(3)Set up the test environment of the distributed system,design the test plan from the functional test,the performance test,the extensibility test three aspects,and carry on the analysis to the test data,and tests the URL weight algorithm and the URL de-emphasis strategy before and after the optimization.In short,the significance of this thesis lies in the design and implementation of the distributed network crawler system,to a certain extent,solves the problems of low efficiency and poor scalability of the crawler,and improves the crawling information and the speed and quality of web crawling.
Keywords/Search Tags:Search engine, Distributed, Web crawler, Hadoop, MapReduce
PDF Full Text Request
Related items