Font Size: a A A

Research And Implementation Of Malicious Crawler Detection System Based On Flink Platform

Posted on:2022-12-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z S RenFull Text:PDF
GTID:2518306773497564Subject:Trade Economy
Abstract/Summary:PDF Full Text Request
In recent years,illegal and criminal cases caused by malicious crawlers have emerged in an endless stream.According to relevant statistics,malicious crawler attacks have been increasing exponentially year after year,and the security threat has become increasingly prominent.Therefore,it is very necessary to design and implement a malicious crawler detection system that meets the needs of the times.Common detection techniques include misuse detection and anomaly detection.Misuse detection has high accuracy but poor flexibility,and cannot adapt to changes in attack methods.Anomaly detection algorithms have good flexibility but usually have a high false positive rate.The detection of malicious crawlers requires both high accuracy and good flexibility and adaptability.For this reason,this paper adopts a combination of misuse detection and anomaly detection.According to the characteristics of malicious crawlers,browser fingerprinting technology is used as misuse detection.,using the isolation forest algorithm as anomaly detection,and efficiently identify malicious crawlers by maintaining and updating the fingerprint database.This paper works as follows:(1)Browser fingerprint features are extracted based on Shannon entropy,and a total of 14 features are selected as browser fingerprints.Since the characteristics of browser fingerprints may change frequently,the Bayesian network is selected as the fingerprint identification algorithm.Due to the complex structure learning and incomplete training data,this paper adopts the method of expert knowledge to construct the Bayesian network structure.The maximum likelihood estimation method learns the parameters.(2)Aiming at the problems that the traditional crawler detection system cannot perform online real-time detection and model adjustment according to the change of data flow,an improved real-time detection algorithm of isolated forest based on data flow is proposed.According to the characteristics of malicious crawler access,the isolated forest algorithm model is improved into a flow model,and the improved algorithm model can dynamically update the isolated trees in the forest.The buffer mode is adopted.After the model is initially established,the data is stored in the buffer.When the buffer reaches a certain threshold,the forest is updated by fission of the isolated tree.After the update is completed,the trees with large deviations in the detection structure will be discarded,and then new trees will be constructed and replaced according to the corresponding rules to ensure the invariance of the trees in the forest.(3)While studying the fingerprint technology and the isolated forest algorithm,this paper hopes to implement the algorithm in real scenarios and improve it in the process of solving real problems.Therefore,on the basis of the algorithm,based on the design and implementation of Flink,a complete set of enterprise-level malicious crawler detection system is built.The final experiments show that the system has excellent performance in detecting malicious crawler.
Keywords/Search Tags:Malicious crawler, Flink browser, Fingerprint, Isolation forest algorithm
PDF Full Text Request
Related items