Font Size: a A A

Phishing Detection Technology Based On URL And Web Page Features

Posted on:2020-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:X ChenFull Text:PDF
GTID:2428330596995027Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
In the 21 st century,the Internet is deeply affecting our lives.While facilitating our lives,there are also serious security risks.In the process of information industrialization,phishing websites have been threatening people's Cyberspace Security by imitating some real websites,such as banks,fraudulent Internet users and illegal access to their property.As people rely more and more on the Internet,the number of phishing attacks is also growing rapidly.Therefore,it is necessary to study effective detection methods to ensure network security.This thesis presents a detection method based only on the URL features and the combination of URLs and web page features.Website URL features can be roughly divided into three categories: basic features,letter frequency features and editing distance.Editing distance feature extraction strategy is the innovative feature of this thesis.In the experiment,we can find that the editing distance feature can effectively increase the accuracy of the model,and the final accuracy is 0.946 and 0.959 on 4000 and 40000 data sets,respectively.Especially when extracting URL features,using Aho-Corasick algorithm to extract features automatically can increase the number of pattern strings to be matched,without significantly increasing the matching time.Finally,only using the URL features extracted on 40,000 web site data sets consumes an average time of 14.1 ms/item.This thesis also proposes three types of web page features: the number of internal and external links,forward links and intra-site links,and whether there are login windows.By optimizing the parameters of GBDT classifier model and combining the URL and web page features,the accuracy of the model is 0.976,which can effectively resist phishing attacks.Because Web feature extraction takes a long time: the number of internal and external links and whether to include the login window take about 2 seconds,while Back links and Own links take about 20 seconds.In this thesis,based on MongoDB+ES,a large number of eigenvalues are stored and queried.The average time of synchronization website is 0.317 ms/item,and the average time of inquiry website is 17.914 ms/item,which greatly improves the detection efficiency.In the aspect of system design,it includes single web site and batch web site detection function.Due to the uncertainty of web access,GBDT classifier A based only on URL features and GBDT classifier B based on all features are trained to ensure that user feedback is always given.
Keywords/Search Tags:Phishing Website Detection, Aho-Corasick, Editing Distance, Machine Learning, Feature Storage
PDF Full Text Request
Related items