Design And Implementation Of Anti-Crawler System Based On Spark Streaming

Posted on:2022-10-18

Degree:Master

Type:Thesis

Country:China

Candidate:J Y Guo

Full Text:PDF

GTID:2518306605490244

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

With the accelerated construction of the Internet in modern society.The applications of the Internet and big data technology have played a significant role in technical means.Also,the greatness of all kinds of data has become constantly prominent.Web crawlers are becoming more and more prominent in the manner of collecting Internet data.With the expanding prevalence of crawler technology,malicious or badly designed crawlers will cause a huge number of server resources,dripping private data,and other negative obstacles.According to the current condition that the anti-crawler technology has gradually become the study purpose in various companies.And it plays the role of protecting data,ensuring system stability,and maintaining competitive advantage.Therefore,an anti-crawler system needs to be developed to provide users a better product experience and reduce the access of malicious crawlers.In practical applications,the hidden technology of crawlers is becoming complex than before,which makes the anti-crawler technology becomes harder and harder to accomplish.This thesis examines the anti-crawler rules and develops an anti-crawler system using Spark Streaming technology.The specific content is as follows:(1)The anti-crawler system practices Spark Streaming technology and Lua+Nginx+Kafka framework.Data acquisition module is collected,processed,and calculated the system access flow within real-time.Information is analyzed by data cleansing and desensitization,including using the My Sql database for data storage and Redis for storage and maintaining.Real-time data processing provides data support for identifying crawlers.(2)This thesis designs anti-crawler rules based on the access rules of the crawler IP(Internet Protocol)by analyzing and reviewing the modern anti-crawler methods.The User-Agent appears in a particular access application without the browser words.The anti-crawler system sets threshold detection to judge the access IP address.The total number of visits to essential pages from the visiting IP address compares with threshold within a specific minute.The number of cookies from the visited IP address to the essential page compares with threshold in a particular minute.By applying the corresponding anti-crawler rules,malicious crawlers are put into the blacklist in real-time monitoring of tasks.The anti-crawler system enhances the accuracy and speed of anti-crawlers,and further ensures the security of the system.(3)This thesis designs real-time monitoring of the system and uses Spark Streaming realtime processing technology to calculate and analysis anti-crawler results.Then,the anticrawler system shows the data collection traffic and crawler active period and crawling frequency in the form of charts.It is helpful for users to discover the rules of crawlers.This thesis first explains the study background of the anti-crawler system,including the modern situation of national and international analysis.Then,this thesis introduces key technologies that the entire anti-crawler system is required to be analyzed.Furthermore,the anti-crawler system framework and system database are designed by the demand analysis.Moreover,the anti-crawler system is divided into a data acquisition module and data processing and real-time calculation module and data visualization module.This thesis also includes the detailed design and implementation of the anti-crawler system.Finally,the anticrawler system is developed and tested as the ending part.

Keywords/Search Tags:

Crawler, Anti-Crawler, Big Data, Internet Applications

PDF Full Text Request

Related items

1	Research On Topic Focused Web Crawler And Related Technologies
2	Ecological Scientific Investigation Data System With Anti-crawler Mechanism
3	The Design And Development Of Deep-Customizable Crawler Tool System
4	Internet Crawler Research And Implementation
5	Research And Application Of Distributed Crawler Technology Based On Ant Colony Algorithm
6	Based On Anomaly Detection Technology Anti Crawler System Design And Application
7	Research And Implementation Of Web Crawler For URL-Specified Crawling Of Ajax-Based Web Applications
8	Design And Implementation Of Web Crawler System Based On Scrapy Framework
9	Research And Implementation Of Content Detection System Based On Net Crawler
10	The Design And Implementation Of Anti-Crawler System At Dianping