Font Size: a A A

Phishing Website Recognition Based On Model Fusion

Posted on:2024-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:J L HuFull Text:PDF
GTID:2558307058480724Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the increasing frequency of web-based user information communication,there are also serious challenges associated with the security of the web.Phishing attacks are one of them,which send fake links through various means to lure users to log in,then steal their private information and eventually lead to the users’ privacy disclosure and property loss.Therefore,in order to create a safe network environment and avoid the loss of Internet users’ property,it is particularly important to establish an effective phishing site identification model to monitor suspicious websites in time.This thesis attempt to establish phishing website identification model based on the dataset published by the Kaggle website in 2021 on detecting and identifying phishing websites.Firstly,the research background,significance and research status at home and abroad are analyzed.The relevant algorithm theories used in the article are introduced.Then the descriptive statistical analysis of feature variables are carried out.The differences between the legitimate websites and phishing websites in construction website URL structure,page content and external query service are analyzed.Secondly,data preprocessing and feature engineering are carried out,outliers are eliminated,the variance filtering method and RF-RFE algorithm are used to screen the features,and 17 redundant feature variables are eliminated.Then,the single classifiers are constructed,and the models with better prediction effect on single-model training are chosen as the base models for subsequent use.Finally,the fusion model is constructed,the dataset is divided into a training set and test set at a ratio of 7: 3,XGBoost,Light GBM and Random Forest,which have better prediction effect in single model,are selected to construct the traditional Stacking model.Given the poor performance of the model,the idea of Stacking ensemble model is used to improve the construction of the first layer model.The data set is divided into three parts according to different sources,and the same kind of base classifiers are fused according to different source data to construct the XGBoost-Stacking model and Light GBM-Stacking model,and the evaluation indicators of the fusion model are compared and analyzed.The results show that the Light GBM-Stacking model has the best prediction effect.On this basis,the Bayesian optimization method is used to globally optimize the parameters of the model,which further improves the prediction effect of the fusion model.Compared with the improved Stacking phishing website recognition model existing in the literature,the Light GBM-Stacking model optimized by Bayes has a relative increase of 1.45% in recall rate,a relative decrease of 32.71%in FNR,and a relative increase of 2.43% in AUC value.The prediction effect is better and the model is robust.
Keywords/Search Tags:Phishing Website Identification, Machine Learning, LightGBM-Stacking Fusion Model, Bayesian Optimization
PDF Full Text Request
Related items