Font Size: a A A

Phishing Detection Based On Semantic Features And Self-Supervised Model

Posted on:2023-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:X X QuanFull Text:PDF
GTID:2558307070483884Subject:Engineering
Abstract/Summary:PDF Full Text Request
Phishing is a typical crime which has a serious impact on finance,politics,e-commerce and other fields.The cybercriminals get sensitive information through specially designed phishing URL that look like legitimate one and cheat the victims into clicking the phishing link.Compared with web-page and visual similarity matching,the detection method based on URL has lower cost and higher detection effect.Most of the phishing detection is based on the URL features,and using machine learning to detect,or using deep learning model for learning and classification based on marked samples.The former mainly depends on URL features and has poor performance.The latter requires balanced sample,otherwise,it will lead to have excellent performance on accuracy but poor on recall.To solve the above problems,this paper proposes two phishing URL detection methods:(1)We Proposed a phishing URL detection method based on semantic features.Firstly,we build the database of basic morphemes,and divide the URL into a set with delimiters.Then,we extract 10 semantic features through word segmentation technology and a basic vocabulary database.Combined with 16 character features form existence studies,and using machine learning for phishing URL detection.Unlike character features,the proposed method extracts the character-level and word-level features,and the results show that the method can reach 96.59% accuracy.(2)For the problem that the number of legitimate URLs is more than the number of phishing URLs,we proposed a phishing URL detection method based on Self-supervised model——PDSS(Phishing detection based on seq2 seq model).Firstly,we extract semantic features and nonlinear features through an encoder for legitimate URL.And we use the seq2 seq model which based on LSTM network,to predict legitimate URL.We set the threshold by calculating the reconstruction loss of the reconstructed URL and the original URL,and identify whether the test URL is phishing URL.Compared with traditional deep learning,the proposed method does not need phishing URL in training,and it can be applied to the situation of lack of negative samples.Experiments showed that proposed method reached a precision rate of 99.68% and a recall rate of 98.11%.
Keywords/Search Tags:Phishing URL Detection, Semantic Features, Autoencoder, Self-supervised Model
PDF Full Text Request
Related items