With the development of the internet and the increasing number of internet users,trojan injection,phishing,distributed attacks and other network attacks keep emerging because of the high-speed and frequent exchange of information,which seriously threaten the privacy of individual users,the ecology of network environment and the security of national information property.Many web network attack using spread malicious URLs.This paper studies the detection of malicious URLs.To solve the problem that blacklist mechanism can only detect and identify the detected malicious URLs,but cannot predict the newly appeared and unmarked malicious URLs,this paper conducts statistics and analysis on a large number of URLs,designs and proposes the feature space of malicious URLs detection with high detection rate.At the same time,combined with machine learning and deep learning algorithm for detection experiments,it is proved that the 34 dimensions features,proposed in this paper,such as time and consonant ratio,have a good ability to distinguish malicious URLs,and the detection accuracy is as high as 99.5%.Through the contrastive and analysis of the feature sets,it is found that 15 dimensions features,such as time,maximum length of subpath,probability sum of URLs' tuples in negative and positive data set,proportion of longest string and different characters in domain name,which are not used or less used in previous studies,but play a key role in this solicitation.To solve the problem that irrelevant features,redundant features and noise features will be introduced in the process of manually designing feature rules,this paper proposes a method to discover comprehensive feature space.The method mainly uses machine learning algorithms such as random forest,J48 and bayesian to select a group of high-accuracy wide-spectrum feature space based on various feature selection algorithms such as information gain,information gain rate and correlation-based feature selection.Experimental results show that the feature space extracted by this method has a good contribution to the detection of malicious URLs,with the detection accuracy as high as 99.4%,the average accuracy of multiple classifiers as high as 98.6%,higher than the full feature set 0.4%,and the feature space dimension reduced by 55.9%.In addition,in view of the difficulties faced by the mainstream feature extraction algorithms in URLs detection and recognition,such as the difficulty in manual rule design and the poor time-effectiveness of rule update,this paper designs a URLs-encoder and combines the three structures of convolutional neural network to realize the self-extracting method of URLs' features.The method constructs the URLs-encoder by counting the number of n-gram(n=1)characters,encodes the URLs into matrix structure,and then completes the initialization of convolutional neural network through pre-training,so as to realize the self-extraction of URLs' features.Then combined with many factors to verify and analyze the feature extraction model.The experiment shows that the feature extraction method of URLs encoding and combining with convolutional neural network proposed in this paper can effectively complete the feature extraction of benign and malicious URLs,and the extractedfeatures have good differentiation,the classification accuracy of multiple classifiers is over97% and up to 99.2%. |