Font Size: a A A

Malicious Webpage Detection Method Based On Cost-sensitive Online Active Learning

Posted on:2021-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:B G ChenFull Text:PDF
GTID:2428330602968837Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of Internet technology,while enjoying the convenience brought by the Internet,people are concerned by cyber attackers because of weak network security awareness and website vulnerabilities.Cyber attacks such as phishing websites and Trojan horses increasingly threaten users' personal privacy and property security.The increasing complexity of attack methods poses a huge challenge to detection.Aiming at the insufficient URL vocabulary features,the optimization accuracy can not deal with the imbalance of categories,the shortening of the life cycle of malicious webpages,etc.,based on the word segmentation technology and extracting the relevant features combined with context and location information,and improving the online active learning objective function,based on cost Malicious webpage detection method for sensitive online active learning.The main research contents and innovations are as follows:(1)For the problem that the existing URL vocabulary feature extraction does not cover the context and location information,on the basis of the word segmentation technology based on URL domain knowledge,convolution is used to extract the corresponding vocabulary features to cover the vocabulary context and location information.Among them,in the word segmentation technology based on domain knowledge,the difference between URL text word segmentation processing and ordinary natural language processing is analyzed.According to the visual similarity between characters,the editing distance is improved to calculate the similarity between the domain name and the brand noun.The word segmented text uses word2 vec to generate word vectors,and then uses 4 different heights,a total of 400 convolution kernels to convert word vectors into feature vectors.Added vocabulary context and location information features.(2)In addition to extracting URL-related features from webpages,in order to make up for the problem of URL-related features being invalidated by URL shortening services,it also extracts webpage-related features and extracts JavaScript code-related features based on structural analysis,internal script analysis and external script analysis,And extract the relevant features of HTML code from the two fields of phishing and web page hanging horse respectively.(3)The conventional supervised learning method optimizes the accuracy rate to construct the model.The distribution of data categories in the malicious webpage detection task is extremely uneven.Simply predicting the webpage as benign can achieve a very high accuracy rate,and the accuracy rate should not be its optimization goal.Considering the difference in misjudgment costs caused by category imbalances,cost-sensitive indicators are used as optimization goals and evaluation indicators,and a learning algorithm is derived based on the cost-sensitive indicators,and the online learning model is combined with the closed solution of the optimization goals of cost-sensitive indicators to meet real-time Malicious webpage detection requirements.In addition,active learning actively queries web page tags for model training.
Keywords/Search Tags:malicious web page, URL segmentation, cost-sensitive indicators, online learning, active learning
PDF Full Text Request
Related items