Font Size: a A A

Malicious Web Site Recognition Based On Page Information

Posted on:2020-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:M XiaoFull Text:PDF
GTID:2428330575487543Subject:Master of Applied Statistics
Abstract/Summary:PDF Full Text Request
With the development of network technology,the threshold for people to access the Internet is getting lower and lower,and the number of people using the Internet is increasing.The hidden dangers are also increasing.Through the Internet,cybercriminals conduct financial information theft,fraudulent purchases,spam and other activities,and malicious websites undoubtedly provide a broad space for criminals.Therefore,the identification of malicious websites is very practical.In recent years,the identification of malicious websites has achieved certain results In the identification of traditional malicious websites,the main methods used are blacklists and heuristic rules.The shortcomings of these methods are too obvious.The way to identify the blacklist is only to identify the malicious URLs that already exist in the blacklist.The heuristic rules are extremelv demanding on the explorer of the rules.Its in-depth research,but such rules are difficult to find and difficult to change.Later,some people used machine learning to identify malicious URLs,but most of them focused on the identification of phishing URLs.The malicious website page information is mainly composed of text information and non-text information.In the current research of machine learning and deep learning,certain achievements have been made in text recognition and classification.This article summarizes the experience of previous people,and combines it with practice to use the results of machine learning to identify malicious websitesThe main work of this paper is:(1)The content of malicious websites starts from two aspects of text information and non-text information,and improves the recognition of multiple categories of malicious websites.(2)For the text information,pay attention to the cleaning of the text,retain the effective information to the maximum extent,and achieve the optimal analysis effect with the minimum manual annotation.The combination of Word Embedding and TF-IDF is used to process text feature and improve the efficiency of the model.The support vector machine model is used to identify and classify malicious URLs,and the reliability of the classification output is used to calculate the first three probabilities.Evaluation.(3)Non-text information is mainly for a certain kind of website based on pictures and video presentations,and the skin image detection is used to identify the bad picture information of the page,thereby achieving the recognition of the bad website.(4)The accuracy rate of the recognition model based on the information of the malicious website page can reach 99%.In the website information accessed by the actual user,the recognition accuracy of the malicious website can reach about 95%.
Keywords/Search Tags:Malicious web address, Page information, Feature processing, Support vector machine, Skin color detection
PDF Full Text Request
Related items