Font Size: a A A

Design And Implementation Of Anti-Crawler System Based On Spark Streaming

Posted on:2022-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:J Y GuoFull Text:PDF
GTID:2518306605490244Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the accelerated construction of the Internet in modern society.The applications of the Internet and big data technology have played a significant role in technical means.Also,the greatness of all kinds of data has become constantly prominent.Web crawlers are becoming more and more prominent in the manner of collecting Internet data.With the expanding prevalence of crawler technology,malicious or badly designed crawlers will cause a huge number of server resources,dripping private data,and other negative obstacles.According to the current condition that the anti-crawler technology has gradually become the study purpose in various companies.And it plays the role of protecting data,ensuring system stability,and maintaining competitive advantage.Therefore,an anti-crawler system needs to be developed to provide users a better product experience and reduce the access of malicious crawlers.In practical applications,the hidden technology of crawlers is becoming complex than before,which makes the anti-crawler technology becomes harder and harder to accomplish.This thesis examines the anti-crawler rules and develops an anti-crawler system using Spark Streaming technology.The specific content is as follows:(1)The anti-crawler system practices Spark Streaming technology and Lua+Nginx+Kafka framework.Data acquisition module is collected,processed,and calculated the system access flow within real-time.Information is analyzed by data cleansing and desensitization,including using the My Sql database for data storage and Redis for storage and maintaining.Real-time data processing provides data support for identifying crawlers.(2)This thesis designs anti-crawler rules based on the access rules of the crawler IP(Internet Protocol)by analyzing and reviewing the modern anti-crawler methods.The User-Agent appears in a particular access application without the browser words.The anti-crawler system sets threshold detection to judge the access IP address.The total number of visits to essential pages from the visiting IP address compares with threshold within a specific minute.The number of cookies from the visited IP address to the essential page compares with threshold in a particular minute.By applying the corresponding anti-crawler rules,malicious crawlers are put into the blacklist in real-time monitoring of tasks.The anti-crawler system enhances the accuracy and speed of anti-crawlers,and further ensures the security of the system.(3)This thesis designs real-time monitoring of the system and uses Spark Streaming realtime processing technology to calculate and analysis anti-crawler results.Then,the anticrawler system shows the data collection traffic and crawler active period and crawling frequency in the form of charts.It is helpful for users to discover the rules of crawlers.This thesis first explains the study background of the anti-crawler system,including the modern situation of national and international analysis.Then,this thesis introduces key technologies that the entire anti-crawler system is required to be analyzed.Furthermore,the anti-crawler system framework and system database are designed by the demand analysis.Moreover,the anti-crawler system is divided into a data acquisition module and data processing and real-time calculation module and data visualization module.This thesis also includes the detailed design and implementation of the anti-crawler system.Finally,the anticrawler system is developed and tested as the ending part.
Keywords/Search Tags:Crawler, Anti-Crawler, Big Data, Internet Applications
PDF Full Text Request
Related items