Font Size: a A A

Fraudulent URL Detection Based On Big Data

Posted on:2019-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y R HuangFull Text:PDF
GTID:2428330590992394Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Phishing is a deception technique that utilizes a combination of social engineering techniques and sophisticated attack methods to gather sensitive and personal information,such as passwords,account details and credit card details by masquerading as a trustworthy person or business in an electronic communication.Aiming at the limitations of existing anti-phishing solutions,a fraudulent URL detection scheme based on big data is proposed in this paper.The research content and main work of this paper are as follows:1.This paper first discusses the definition,working principle,common attacking method and types of phishing websites,then reviews the current mainstream anti-phishing technologies and summarizes the advantages and disadvantages of all these detection techniques.2.This paper presents a detection algorithm based on multiple features of websites.The algorithm analyzes the website's URL and its page content,as a feature vector,and then uses Random Forest,Logistic Regression and Support Vector Machine to classify websites as phishing website or not.3.This paper presents a character-level recurrent neural network for phishing detection.The input of the algorithm is the preprocessed URL string.The algorithm first uses the Skip-gram model in Word2 Vec to convert all the characters in the URL into word vectors and then uses Bi-directional Long Short-term Memory to complete the encoding of the URL text and finally uses activation function to classify phishing websites.4.Finally,the above research results are applied to the Spark MLlib and Keras framework to implement a real-time detection system of fraudulent URL.The average throughput of the system has reached 1000 urls per minute.Experiments show that Bi-directional LSTM can effectively use semantic information and has a better performance than the traditional methods.It is shown by experiments that the proposed algorithm achieves precision of 98% on average on data set downloaded from Phish Tank and DMOZ sites.
Keywords/Search Tags:Phishing, Feature Extraction, Machine Learning, Recurrent Neural Network, Long Short-Term Memory
PDF Full Text Request
Related items