Font Size: a A A

Malicious Web Sites Detection Based On Data Mining Algorithms

Posted on:2018-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y XueFull Text:PDF
GTID:2348330518996895Subject:Cryptography
Abstract/Summary:PDF Full Text Request
With the development of Internet, the website security has aroused wide public attention. The frequent occurrences of malicious web sites attacks brought huge damage to users. At the same time, it becomes a serious threat to the personal and even national security. So how to find a malicious web site feature, and identify whether the site is malicious or not has very important significance.At present, many scholars at home and abroad have improved new method to make feature selection based on host features and lexical features. But the accuracy and efficiency are not high. In order to solve the problems, this paper first puts forward the concept of creating vulnerable sites list and proposes a new feature extraction scheme based on weighted distance. At the same time, this paper improves the KNN model based on the improved FCM algorithm which improves the efficiency of the model. The research work we do is as follows:Data collection: normal websites and malicious websites data is crawled cleaned, standardized. At last, they are put into MySQL database.Feature extraction: different from the common site whitelist, the concept of blacklist, the paper collects sites which are easy to be attacked to put forward the concept of vulnerable sites. It is well known that a malicious website usually makes a certain degree of change to a normal site. We set different weights according to the change type and propose the concept of weighted distance. Finally, we calculate the nearest weighted distance between the URL we input and the URL in the vulnerable site list and use it as a new feature.Method improvement: firstly the KNN algorithm and fuzzy c-means algorithm is improved. As the initial clustering center FCM is uncertain,the shortcoming of it is easily plunged into local optimum. In this paper,the coordinates of the density method is proposed to determine the initial clustering center. At the same time, initial clustering number of FCM algorithm is proposed based on the value of K and the number of data sets.Finally, through computing the distance between characteristics and FCM clustering center, we can find the classification of the inputting data.Model validation: this paper uses the LR model, J48 model and improved KNN model and also the WEKA to operate it. At the same time,this paper uses the data mining method to compare the accuracy between data with the original characteristics and data with the new features, and the classification results are improved. At the same time, this paper compares our methods with other methods in the literature, only to find that the accuracy is improved.
Keywords/Search Tags:Phishing URL detection, URL weighted distance, KNN, FCM
PDF Full Text Request
Related items