Font Size: a A A

Based On Big Data And Integrated Learning Research On Prediction Of Malware Infection

Posted on:2022-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ZhangFull Text:PDF
GTID:2518306566499604Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advent of the information age,the number of PCs has expanded rapidly,and with it comes the harm caused by malicious software.Malware,including viruses,Trojan horses,ransomware,adware,etc.,poses a serious security threat to personal and commercial computers,and its degree of harm is increasing year by year.Real-time and effective prediction of malicious software is particularly important.This article selects the open source data set of the kaggle platform to make a two-class prediction on the subject of computer infection with malware,in order to explore the application of integrated learning algorithms in the multi-dimensional data prediction process related to network security.In this paper,a series of data mining processes such as data storage,integration,preprocessing and algorithm processing are completed in the Hadoop framework.First,in order to achieve an efficient data mining process,the data set is stored based on a distributed file system,and the data integration and preprocessing process is completed based on the Hive data warehouse,and data analysis is performed on each attribute column and prediction label to determine the feasibility of prediction.Secondly,the data set attributes after data cleaning are used to construct the original features,and the original features are used to construct the derived features,so that the number of data set features increases to 181.Thirdly,the three algorithms of XGBoost,Light GBM and Random Forest in ensemble learning are used to construct a single model of malware infection prediction,and evaluate and tune them.The AUC indicators of the three single models are all higher than the traditional decision tree algorithm by more than 0.1.Finally,model construction is based on the idea of?model fusion.Logistic regression and random forest are selected for Stacking fusion based on three single models.Compared with the linear fusion model based on the reciprocal variance method,the AUC is increased by 0.02,and the random forest is used as the relearning algorithm in the Stacking model The prediction performance is better than logistic regression.This paper uses the effective information in the original data set and establishes a two-classification model,and finally realizes an effective prediction of whether the computer is infected with malware.
Keywords/Search Tags:Malware infection prediction, Feature engineering, Ensemble learning, XGBoost, LightGBM, Random forest, Model fusion, Hadoop
PDF Full Text Request
Related items