Phishing Detection Technology Based On URL And Web Page Features

Posted on:2020-04-12

Degree:Master

Type:Thesis

Country:China

Candidate:X Chen

Full Text:PDF

GTID:2428330596995027

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

In the 21 st century,the Internet is deeply affecting our lives.While facilitating our lives,there are also serious security risks.In the process of information industrialization,phishing websites have been threatening people's Cyberspace Security by imitating some real websites,such as banks,fraudulent Internet users and illegal access to their property.As people rely more and more on the Internet,the number of phishing attacks is also growing rapidly.Therefore,it is necessary to study effective detection methods to ensure network security.This thesis presents a detection method based only on the URL features and the combination of URLs and web page features.Website URL features can be roughly divided into three categories: basic features,letter frequency features and editing distance.Editing distance feature extraction strategy is the innovative feature of this thesis.In the experiment,we can find that the editing distance feature can effectively increase the accuracy of the model,and the final accuracy is 0.946 and 0.959 on 4000 and 40000 data sets,respectively.Especially when extracting URL features,using Aho-Corasick algorithm to extract features automatically can increase the number of pattern strings to be matched,without significantly increasing the matching time.Finally,only using the URL features extracted on 40,000 web site data sets consumes an average time of 14.1 ms/item.This thesis also proposes three types of web page features: the number of internal and external links,forward links and intra-site links,and whether there are login windows.By optimizing the parameters of GBDT classifier model and combining the URL and web page features,the accuracy of the model is 0.976,which can effectively resist phishing attacks.Because Web feature extraction takes a long time: the number of internal and external links and whether to include the login window take about 2 seconds,while Back links and Own links take about 20 seconds.In this thesis,based on MongoDB+ES,a large number of eigenvalues are stored and queried.The average time of synchronization website is 0.317 ms/item,and the average time of inquiry website is 17.914 ms/item,which greatly improves the detection efficiency.In the aspect of system design,it includes single web site and batch web site detection function.Due to the uncertainty of web access,GBDT classifier A based only on URL features and GBDT classifier B based on all features are trained to ensure that user feedback is always given.

Keywords/Search Tags:

Phishing Website Detection, Aho-Corasick, Editing Distance, Machine Learning, Feature Storage

PDF Full Text Request

Related items

1	Research On Phishing Detection Based On The Link Features Of Website
2	A Phishing Website Detection Method Based On Stacking Model
3	Research On Phishing Detection Based On Feature Label
4	Research On Phishing Website Detection Technology In Dual-structural Network
5	The Phishing Detection Algorithm Research Based On Meta-learning
6	Research And Implementation On Joint Features And Intelligent Detection Algorithms Of Phishing Webpages
7	Research On Phishing Webpages Detection Based On Machine Learning
8	Research On Phishing Website Detection Based On Data Mining Classification Algorithm
9	Phishing Website Detection By Link Analysis
10	Research On Phishing Website Hierarchical Detection Based On Webpage Features