Font Size: a A A

Research On Malicious Web Page Detection Based On Cost-sensitive Learning

Posted on:2022-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:Q M CaiFull Text:PDF
GTID:2518306560991729Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In the wake of the advent of big data era,the public are confronted with a variety of network security issues;whereas the malicious URL,as the medium for Web attacking,progressively threatens the security of users' information.Traditional detection methods in terms of malicious web pages,such as blacklist detection and signature matching,are exposing their intrinsic defects;moreover,the detection of malicious web pages itself are facing the following challenges: insufficient feature coverage and complicated feature selection,loss of features like URL word segmentation and contextual information,as well as the uneven distribution of samples regarding the normal and malicious categories in the context of actual detection environment.Therefore,to address the aforementioned challenges,this thesis attempts to detect malicious web pages through deep learning and cost-sensitive learning.The focal point of this research and innovations embodied in it are as follows:(1)In this thesis,HTTP request parameters together with URL information are employed as the original data samples to extract features;and the corresponding data processing is carried out to resolve the problem of difficult feature extraction incurred by simple URL data.Meanwhile,after analysis,all the existing work directly select different encoding approaches to represent the digital vectorization.In addition,by comparing three encoding processing methods through tests,this research has chosen the best processing approach in term of character encoding.By doing so,it has ensured the effectiveness of the subsequent detection model,and verified the feasibility of the combination of HTTP request parameters for malicious web pages detection.(2)In this research,a detection model based on Convolutional Neural Network and Bidirectional Long Short-Term Memory are designed and constructed.Regarding the model of neural network,the Convolutional Neural Network model suitable for URL detection is specialized designed for the characteristics of URL character input.In this model,in order to extract the deep features of the data,two convolutional layers are broadly used,and meanwhile,different sizes of combined convolutional window are designed to extract local features of the data.Through experimental comparison,the best methods of combined convolutional windows is therefore selected.Secondly,this research utilizes a Bidirectional Long Short-Time Memory to extract the temporal features of the data from the pooling layer,while in the last unit of this network outputs the temporal features to achieve the pooling effect.In contrast to most studies,which use a fully-connected layer to combine neural network models to extract temporal features,this research method not only effectively extracts the contextual information regarding the data,also avoids an abundant model calculations and thus,ensures the efficiency of model detection.(3)A neural network model based on a cost-sensitive strategies is designed and constructed.The number of malicious web pages in actual application is much smaller than that of normal web pages.According to this phenomenon,the common practice is using ideal data sets for model training,however,the final detection may have false and positive results by using this method.Based on this fact,this research introduces a cost-sensitive strategy in the deep learning network model.In the mechanism of this model,it assigns different penalty factors to data samples during the iterative process,improves the rules for assigning initialization weights to data samples and normalizes them,increases the weight of malicious samples in the overall error function,all this together will make the model more focused on difficult learning samples.The experimental result has demonstrated that the improved detection model can better address the problem of data imbalance,which has showing the potentiality of generalization and high-scalability.
Keywords/Search Tags:Deep Learning, Malicious Web Page, URL Detection, Cost-sensitive Learning, Neural Networks
PDF Full Text Request
Related items