Font Size: a A A

The Design And Development Of Deep-Customizable Crawler Tool System

Posted on:2019-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiFull Text:PDF
GTID:2348330545984479Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the network technology,the content on Internet is growing explosively and becoming more and more abundant.The overload of information makes it harder to collect and process data manually.Therefore,how to collect and extract useful information effectively from the massive data on the Internet is an urgent problem to be solved.Crawler technology effectively helps a lot in collecting data.However,writing crawler programs is a cumbersome and complex task,especially when crawling a large number of similar or different websites or Apps.If each crawler program is written for specific website or App,it means repeated,complex work and trimendous revision and maintenance effort.In addition,new programmers with less experience in writing crawlers may not be able to successfully write crawlers all by themselves.And it may be hard for them to succeed crawler projects written by other programmers.Therefore,a general crawler tool was designed and developed in this paper.Configuration files could be customizes to accomadate settings for different target websites,avoiding case by case programing and faciliatating secondary development.The crawler proposed in this paper was not designed for specific website,but designed to be a general tool which allows users to complete their own crawlers by writing the configuration file.In order to make it easy for secondary development,the architecture was highly abstract and complex modules were invisible to users.Based on the investigation and analysis of crawler technology,a crawling system based on the open-source framework called Scrapy was designed and developed.In addition to improvements and innovations to Scrapy,supporting modules were also developed to ensure robustness and efficiency.In the end,the stability and efficiency of the crawler system was verified by system testing.The potential optimization was also suggested.
Keywords/Search Tags:general crawler, parsing template, remove duplicate URL, anti-crawler technology, monitor and alarm
PDF Full Text Request
Related items