Font Size: a A A

Machine Learning Based Malicious Webpage Analysis

Posted on:2020-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:W Y ZhouFull Text:PDF
GTID:2428330623963758Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the surge in the number of Internet users,cyber security issues have become more and more serious.Many hackers illegally obtain profits via attacking the webpages.Therefore,it is necessary to study and analyze malicious webpages.Among many malicious webpage types,hidden hyperlinks and phishing webpages are especially common.Hidden hyperlinks not only bring profits to the gray industry,but also seriously impair the credibility of some authoritative websites.Phishing webpages obtain the private information and money through fraudulent netizens.Therefore,this paper focus on studying and analyzing hidden hyperlinks and phishing webpages.Owing to the advancement of the big data era,the number of webpages has reached tens of thousands.Traditional malicious webpage detection methods are no longer applicable when facing the massive webpages.Therefore,this paper introduces the method of machine learning into the field of malicious webpage detection.For hidden hyperlinks,this paper implement a machine learning based hidden hyperlink detection model.The method innovatively combines three different types of hidden hyperlink features,namely hidden hyperlink text feature,hidden hyperlink domain feature and hidden attribute feature of hidden hyperlink.On the basis,three classification algorithms,CART,GBDT and random forest,are introduced to construct the classifier.The model is evaluated by public datasets under the real business scenario.The experimental results show that compared to CART classification tree and GBDT,random forest shows the strongest generalization ability in the field of hidden hyperlink detection,and the time consumption is relatively moderate.Meanwhile,this paper proposes a phishing webpage detection model based on machine learning.The model innovatively combines three different types of phishing webpage features,such as URL features,text features and HTML features.To realize the model,three classification algorithms,namely isolated forest,random forest and XGBoost,are introduced to construct the classifier.In order to simulate the real-life business situation,this paper evaluates the model with a severe imbalance dataset.The experimental results show that compared to isolation forest and random forest,XGBoost not only has the best generalization ability and stability,but also costs the least time consumption.The hidden hyperlink detection model and phishing webpage detection model proposed in this paper can be applied to developing malicious webpage detection system in industrial situation.Furthermore,it shows that applying machine learning to the research of malicious webpage could achieve fantastic results and provide a new idea for the future analysis of malicious webpage.
Keywords/Search Tags:hidden hyperlink, phishing webpage, machine learning, random forest, gradient boosted decision tree
PDF Full Text Request
Related items