Font Size: a A A

Research On Malicious Domain Names Detection Method Based On Deep Learning

Posted on:2022-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:S H ZhangFull Text:PDF
GTID:2518306515966499Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
The Domain Name System(DNS)is a naming system for machine addresses on the internet.Taking IPv4 in the TCP/IP protocol as an example,the IP address is composed of numbers and ".",which is not easy to understand and remember.Therefore,the internet widely uses DNS to indicate the correspondence between domain names and IP addresses.At the same time,since the domain name system itself lacks the ability to defend against threats such as web page tampering,web page phishing,and web page hanging horses,network hackers and attackers often use domain names to implant vulnerabilities,backdoors,and malicious programs,which have caused serious harm to the network security of internet users.Therefore,accurate and real-time detection of internet domain names to resist attacks from malicious domain names is an important measure and barrier to protect the security of every network user.In this paper,deep learning related technologies are used to detect internet domain names.Firstly,a malicious domain name detection model based on multi-feature selection is proposed;then a malicious domain name detection model based on Doc2 vec and serial networks is proposed by combining natural language processing and neural network;finally,the Focal Loss function was introduced to construct the DLR-FL(Doc2vec-LSTM-RNN-Focal Loss)malicious domain name detection model.The specific research contents of this paper are as follows:(1)In the current malicious domain name detection based on machine learning classification algorithm,if less features are used,the detection speed is faster but the accuracy is not high enough;if more features are used,the detection speed is slower and overfitting is easy to occur.To solve such problem,in the first stage of this paper,a large number of domain name URL word-formation features are extracted and sorted;then the feature selection algorithm was used to calculate the weight value of the global features,and the top 20 key features were selected to construct the key feature set;moreover,the Relief algorithm was improved to shorten the time of feature sorting selection and optimize the real-time performance of the detection model;finally,the constructed key feature set is used as the input of the C5.0 classifier to realize the classification of legitimate and malicious domain names.The experimental results show that,first extracting the global features of the sample to be tested,and then sorting and selecting them,not only ensures the comprehensiveness of the feature values,but also avoids model overfitting,and improves the detection accuracy.At the same time,the C5.0 decision tree can maintain a low computational complexity and increase the detection rate under the premise of ensuring a high two-class classification accuracy.The combination of the improved Relief algorithm and the C5.0 decision tree classifier can better realize the accurate and fast detection of malicious domain names.(2)Since machine learning detection methods need to manually extract feature values,and the extraction method is relatively fixed,it is easy to be circumvented by attackers,and has certain shortcomings in generalization.Therefore,in the second stage of this paper,the domain name text is embedded into the vector space to obtain more comprehensive features,instead of the process of manual extraction of feature values,saving time and space resources.Firstly,the malicious domain name set is sorted and classified according to different types of DGA families;secondly,the idea of natural language processing is introduced,and the Doc2 vec algorithm is used to perform distributed vectorization processing on the domain name set;then the bidirectional LSTM neural network and the bidirectional RNN neural network are constructed,and the two were combined in series to perform deep-level feature extraction on the sample vectors;finally,the Softmax function is used to classify the legitimate and the malicious domain names.Through experiments on public data sets,the results show that the combination of Doc2 vec algorithm and serial neural network enhances the ability to extract features.For different types of DGA domain name families,maintains more accurate detection results and solves the generalization problem of the first-stage model.(3)In the training process of the model,there is an imbalance between the easily classified samples and the difficulty classified samples in the data set,which will cause the loss value contributed by the samples that are difficult to be detected can only play a small role and affect the detection accuracy.Therefore,in the third stage of this paper,DLR-FL model is constructed based on DLR model and Focal Loss function to further improve the detection performance.And the sample size of the data set is expanded,and experiments are carried out for a variety of different types of DGA domain name families,and related works of the same kind are compared to verify the accuracy and robustness of the algorithm model in this paper.
Keywords/Search Tags:Malicious domain name detection, Domain names words formation features, Decision tree classifier, Natural language processing, Deep learning
PDF Full Text Request
Related items