Font Size: a A A

Design And Implementation Of A Distributed Intelligent Web Crawler System

Posted on:2017-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:G Z HeFull Text:PDF
GTID:2348330512952112Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the Web has become the main platform for people to publish and retrieve information. How to quickly and accurately acquire information from the massive web pages resources becomes an ever-increasing demand, web crawler is such a research area to meet the need. The different application fields and different web user who has different background have different data retrieval and purpose, there are have developed different types of web crawler. Currently it has a large number of studies reptiles and open source implementations of web crawler but however most of these studies are limited to the overall architecture or some local processing stage, the complete lack of research crawler collection strategy, page data extraction, data storage, lack of complete research of web crawler crawl strategy, data extraction, data storage, system monitoring and have less automatic processing function, it is difficult to form a complete available large-scale web crawler system. So, the research and improvement of the web crawler is a great significance work.In this thesis designs and implements a distributed web crawler system which based on the existing research work, the goal is to provide high-quality data support for the network public opinion system. This thesis focuses on the following research aspects:First, Seed intelligent management. According to the seed collection history information, dynamically adjusts the scheduling frequency, and the web page extraction model is automatically generated by analyzing the sample pages of the detail pages to realize the automatic extraction of the web pages.Second, Ajax dynamic web pages collection. Get the abstract dynamic web page model from single-page and multi-page interactive dynamic updated web pages and use the browser component Phantomjs to render the dynamic web pages and also designs and implements an automatically navigate browsing script language (NASScript) which based on javascript to achieve the crawler can automatically interactive browsing operation on the dynamic web page and collect the data from the dynamic web pages.Third intelligent management of crawler system. The crawler service nodes are monitored in real time by deploying third-party program modules and intelligent management of the whole crawler system according to the scheduled maintenance processing rules.In this thesis, the crawler system to solve the single crawler of low efficiency, poor scalability, low degree of automation, improve the speed and accuracy of data extraction and expanded the scale of information collection. At the end of this thesis shows some screenshot samples and the test results of the web crawler system. The crawler system can collect data from dynamic web pages and the results of automatic extraction achieves a high precision and also realized the intelligent management of the whole crawler system.
Keywords/Search Tags:distributed framework, machine learning, intelligent extraction, automatic navigation browsing
PDF Full Text Request
Related items