Font Size: a A A

Research And Implementation Of Distributed Crawler Technology

Posted on:2020-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:L MaFull Text:PDF
GTID:2428330623962982Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The data in the network contains a large amount of valuable information.In order to automatically collect,analyze,format and store a large amount of data information on the web pages,a distributed web crawler technology is proposed,and the implementation methods and technical details of web big data crawling and collection are discussed.Through the construction of distributed cluster,Nutch crawler framework is built on Hadoop distributed cluster,and Zookeeper is used to coordinate and schedule the cluster,and Redis high-performance key-value database is adopted to store the data.In the framework of the Solr engine,can be captured information clearly index,show.Through the extraction of page information algorithm to optimize the extraction of page information process,keyword matching optimization algorithm to obtain indicators related data,complete the data collection and page analysis,and then achieve the purpose of distributed,accurate and modular crawling of web data.Through the construction of Hadoop cluster,the implementation of Nutch project and the collection of a large number of data,the technical feasibility of nutch-based distributed web crawler architecture and operation process is verified.The comparison of experimental data shows that the extraction page information algorithm and keyword matching optimization algorithm largely optimize the crawling process of the crawler,making the crawling process more rigorous and accurate.The comparison and analysis of multiple sets of experimental data between nutch-based distributed crawler and other similar crawlers confirms that distributed crawler technology is superior to other traditional crawlers in terms of performance and accuracy,which is more suitable for the crawling of massive data and shows good performance in terms of speed and capacity.
Keywords/Search Tags:Distributed crawler, Nutch, Solr, Get page information
PDF Full Text Request
Related items