Network security has caused broad public concern.As time goes on,the threats of some common attacks,such as Distributed Denial of Service and SQL injection,are increasing.An intrusion detection system(IDS)is one of the critical technologies to maintain network security.Based on machine learning and deep learning techniques,IDS effectively detects abnormal network behaviour.However,the challenge of data imbalance in intrusion detection still affects the classification performance of intrusion detection systems.The imbalanced data used in existing detection methods will lead to overfitting,which reduces the detection rate on attack samples.To solve the imbalance problem in the intrusion detection domain,the paper proposes a clustering and instance hardness-based oversampling method.This method first pre-processes the input traffic data,calculates the proportion of majority class samples in the nearest neighbour samples for minority data and take the result as the hardness value,then clusters the minority data.Secondly,the statistical optimal allocation method is used to calculate the amount of data generated in each cluster.The ‘safe’area is divided using the hardness value in each cluster.Finally,new samples are created by interpolation within the area.This method generates synthetic data and aims to deal with the imbalance problem at the data level.The paper also proposes a classification method based on the ensemble of unsupervised learning techniques to solve the imbalance problem in intrusion detection.Firstly,the input data is preprocessed.Secondly,a correlation distance matrix is constructed for the features of the processed data,and the features are divided into several groups by clustering.Thirdly,lightweight autoencoders are constructed.All autoencoders adopt a three-layer neural network structure and use the Sigmoid function to activate the neurons of each layer.The feature groups are used to train autoencoders,respectively.After that,the reconstruction errors of all autoencoders are calculated by reconstructing the input data.Finally,the Isolation Forest algorithm is used for the classification based on these errors.The algorithm-level method aims to counter the imbalance problem through unsupervised methods such as autoencoder.Experimental results show the proposed oversampling method has a better generalization ability and classification accuracy compared with other sampling methods,and the method can be well applied to the area of intrusion detection.The results also show that the proposed classification method achieves a higher detection rate and consumes less time than other methods based on unsupervised learning. |