Font Size: a A A

A Neural Network Phishing Detection Research Based On Decision Tree And Optimal Feature Selection

Posted on:2021-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y JuFull Text:PDF
GTID:2428330629980177Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of the mobile devices and social networking technologies,phishing has become an increasingly serious threat to online networks.Phishing attackers use social engineering techniques such as e-mail and SMS to lure users to access them,stealing the visitor's username,account password,and financial-related sensitive information,thereby causing serious economic losses to the victims.Therefore,there is an urgent need for effective methods and techniques for detecting and preventing phishing attacks.Traditional phishing website detection methods mainly focus on the basic mechanism of phishing,ignoring emerging attack technologies,target environments and the latest phishing websites.Due to strong active learning ability from massive datasets and high accuracy in data classification,neural network is commonly used for detecting and preventing phishing attacks.However,during the process of training neural networks,many noise points in the public datasets,such duplicate data points and data points with negative or useless features,will trap the neural network classifier into the problem of over-fitting.This problem usually causes the trained classifier cannot precisely detect phishing websites.Aiming at relieving these problems,this paper proposes the DT-ANN,a neural network phishing detection model based on decision tree and optimal feature selection.In this model,the traditional K-medoids clustering algorithm is firstly improved to remove the duplicate sample points from the public datasets.Then,the optimal feature selection algorithm based on the new defined feature evaluation index(f_Value),decision tree and local searching is designed to prune out the negative and useless features.By doing this,the over-fitting problem during the process of training the neural network classifier is relieved.Finally,by adjusting the parameters,the optimal structure of the neural network classifier is constructed and trained by the selected optimal sensitive features.Experimental results have demonstrated that the proposed DTANN exhibits higher performance than many of the existing methods.The main work of thesis is as follows:(1)Uses the improved K-medoids clustering algorithm to refine the phishing dataset.In the machine learning based phishing detection systems,public datasets are usually used to train the underlying classifiers before they are used to test or detect phishing attacks.However,many public datasets are generally are flooded with noise points or duplicate points.These points will degrade the performance of the classifiers or even trap them into the problem of over-fitting.In this paper,based on the Euclidean distance,the traditional K-medoids clustering algorithm is improved by incrementally selecting clustering centers(medoids)rather than selecting them randomly.By the improved K-medoids clustering algorithm,a refined set of training instances that can well represent the original dataset is generated.(2)Proposes a new feature evaluation index(f_Value).In many machine learning based phishing detection systems,sensitive features that can represent the target URLs and their related websites are extracted to train the underlying classifiers.Actually,different features have different effects on the performance of the classifier.Positive features will improve the performance of the classifier.However,useless and the negative features will seriously degrade the detection accuracy of the final classifiers.In order to evaluate the impact of different features on the phishing detection,this paper proposes f_Value,a new feature evaluation index.The new f_Value index is defined based on the Gini coefficient and the decision tree.(3)Designs a new feature selection algorithm.Generally speaking,the availability of adequate number of features and the approach of choosing best features are the main reasons of good performance of the machine learning classifiers.However,excessive features will enlarge the scale and bring complex computation of the final classifiers.Furthermore,the collected features may contain useless and negative features which are harmful to the performance of the classifiers.In this paper,the new optimal feature selection algorithm that based on the new defined f_Value index,the decision tree and local search is designed to select the optimal feature set for the underlying classifier.
Keywords/Search Tags:Phishing detection, K-medoids clustering, feature selection, neural network
PDF Full Text Request
Related items