Font Size: a A A

An Improved Random Forest Algorithm And Its Application On Intrusion Detection

Posted on:2022-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:K JiangFull Text:PDF
GTID:2518306476490594Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Intrusion detection(ID)is one of the effective means to ensure network security.In order to improve the detection rate of intrusion detection systems and reduce the false alarm rate,more and more machine learning algorithms are applied to it.The random forest algorithm is used in the firewall components of various key network equipment due to its simple principle,accurate classification and strong generalization.However,the network message data has a high imbalance,and a high-dimensional feature set,which greatly reduces the classification performance of the random forest algorithm when it is used.And because the random forest algorithm needs to build a large number of decision trees,the modeling time is too long.In response to the above problems,this thesis researched and proposed several improvements to the random forest algorithm,on which an intelligent firewall system suitable for home gateways was designed.The main research contents of the thesis are as follows:(1)Aiming at that the network message data has a high imbalance,and a high-dimensional feature set,a random forest algorithm MS-FPR-RF(Mixed Sampling and Feature Pre-ranking Random Forest)based on mixed sampling and feature pre-ranking is proposed.The improved algorithm divides the data into the majority class sample and the minority class sample,and the boundary judgment is performed on the minority class sample.If it is the border minority class,it will be sampled multiple times.Secondly,the feature pre-ranking method is used to qualitatively sort the features according to the classification ability,and the features with weaker classification ability are deleted to train a higher-precision decision tree.After constructing many high-precision decision trees,the double failure metric DF is used as the distance between the trees,and the k-means++algorithm is used to select a decision tree with strong independence to form the final random forest model.Experiments show that the improved random forest algorithm has higher classification accuracy than traditional algorithms on multiple data sets.Especially on the CSE-CIC-IDS2018 data set with unbalanced data categories,the correct rate of the improved algorithm reached 81.3%,and the accuracy rate reached 93.4%.(2)Aiming at that the amount of network packet data is too large and the modeling time of random forest is too long,this thesis uses the Spark distributed framework to parallelize the algorithm.A comparative experiment was conducted on 180,000 pieces of data,and the running time after 5-node parallel processing was reduced from the original 100 seconds to 22 seconds.(3)Based on the improved random forest algorithm and Spark distributed framework,this thesis designs and builds an intrusion detection system to verify the protection effect of the improved algorithm.The experimental data comes from the collected log files of the home gateway and the scanning messages sent by the scanning software.This system is mainly divided into two parts: algorithm modeling and intrusion detection.Experimental results show that the improved algorithm improves the detection accuracy from 82.2% to 86.5%,and the detection time for 68341 data is reduced from 183 seconds to 42 seconds.
Keywords/Search Tags:Cyber security, Intrusion detection, Random Forest Algorithm, Spark distributed
PDF Full Text Request
Related items