Font Size: a A A

Research On The Feature Selection Techniques Based On SVM For Network Data

Posted on:2014-02-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:K DaiFull Text:PDF
GTID:1108330482979106Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In recent years, network data recognition system has become a new important research field, which has been widely applied in numerous domains including intrusion detection, document classification, and social network analysis and so on. Feature selection techniques, which aim at eliminating irrelevant features while maintaining or even improving the performance of the learning algorithm, are hot topic in current information science and have played a vital role on constructing the network data recognition system.The usual practice of the existing feature selection algorithms used for the network data is to select the best subset from a known feature set according to specific evaluation criterion, where an original feature set of the network data is required before the process of feature selection. For example, Moore presented a feature set that contains flow-based 248 features.The main problems of such approach are as follows: On the one hand, the algorithms based on the statistical features of the network traffic can only be used for coarse classification for the network data, but couldn’t for fine classification; On the other hand, due to the rapid development of network technology, there emerge a large number of network data which could not been understood using the open standard protocol specifications. In this case, it is difficult to acquire such original feature set. Even if it can be obtained, it will fail to depict the diversified unknown protocols. Therefore, these feature selection algorithms possessing the ability of automatically selecting the most informative features from the network data are urgently needed.In order to solve the above questions, focusing on the the autonomous learning ability of the process of feature selection and the expansibility of the selected features of the network data, this paper mainly discusses the theoretic research and related application in network data recognition system of feature selection based on the classification theory of support vector machines(SVM). The main research results include:1.Oriented to the linear separable network data with the known specification and labels, a supervised feature selection algorithm with the ability of automation based on SVM for multi-class classification is proposed to overcome the defects of the existing supervised feature algorithms that lack of the autonomous learning ability. At the same time it can be used for fine classification as well. The proposed algorithm, which uses the original contents of the network data as input, combines 1-norm and 2-norm penalties and can select automatically the most important features that contribute to classification. In order to solve this algorithm with the non-differentiabilities of the loss function and the 1-norm regularization term, an efficient algorithm based on the alternating direction method of multipliers(ADMM) is then developed. The results of the theoretical analysis are also given, including adaptive adjustment of the penalty parameters, the number of selected features, the number of training samples and test error. The performance of this proposed algorithm is examined in simulations and with three real network traffic datasets, as well as three open datasets relying on 5-fold cross-validation. Compared with other supervised feature selection algorithms, the selected features by the proposed algorithm accord with the semantic relationship of the network data and their validity are validated by the fact that the classifier trained by using these selected features can obtain higher prediction accuracy.2.Focusing on the linear separable network data with the known specification and only small number of labeled samples, a semi-supervised feature selection algorithm with the ability of automation based on SVM is proposed to conquer the deficiency of the most existing semi-supervised feature algorithms where an originial feature set is maintained before the process of feature selection. The proposed algorithm adopts a clipped symmetric hinge function and can automatically select important features from the network data by solving a mixed integer programming problem. In order to solve this algorithm, an efficient algorithm based on ADMM is developed. The results of the theoretical analysis are given, including convergence characteristic, adaptive adjustment of the penalty parameters and computational complexity. This proposed algorithm reduces the manual participation compared with supervised methods.The performance of this proposed algorithm is examined in simulations and with three real network traffic datasets, as well as six semi-supervised benchmark datasets by executing on 5-fold cross-validation. The experimental results illustrate that the selected features by the proposed algorithm accord with the semantic relationship of the network data and the classifier trained by using these selected features can obtain higher prediction accuracy, which shows the high quality of the selected features. This proposed algorithm can also been applicable to the linear separable network data with the unknown specification, in which case this algorithm is equivalent to an unsupervised feature selection algorithm. The performance of this unsupervised feature selection algorithm is also examined in simulations and with three real network traffic datasets, as well as six benchmark datasets by exploiting 5-fold cross-validation.3.Facing the non-linear separable network data, a supervised feature selection algorithm based on SVM for classification and an unsupervised feature selection algorithm based on SVM for clustering are proposed respectively, which map the non-linear network data into a high dimension feature space using a non-linear mapping function, thus obtaining the high-quality feature set while having a strong learning ability. In order to select features in the high dimension space, the analytical expression of the mapping function is required, which is solved by using the theory of functional fitting according to a specific kernel function. The performance of both proposed algorithms is examined in simulations and with three real network traffic datasets, as well as benchmark datasets relying on 5-fold cross-validation. The experimental results illustrate that the selected features by the proposed algorithms accord with the semantic relationship of the network data and can obtain higher prediction accuracy, which shows the high quality of the selected features, but both the proposed algorithms are worse than the comparative algorithms in the time complexity.4.Making use of the proposed feature selection algorithms including supervised and unsupervised algorithms, a network data automatic recognition system based on SVM is proposed. The proposed scheme can automatically get the relevant features from the original data content and then separate the network data with unknown specification from the network data with known specification, which are classified into specific categories. Furthermore, the proposed system can also play a role of guidance to field segmentation and understanding of the unknown protocols. The performance of this proposed system is evaluated with simulated data and three real network traffic datasets, as well as one open dataset, and good results are obtained.
Keywords/Search Tags:Artificial Intelligence, Support Vector Machines, Network Data, Feature Selection, Supervised Learning, Semi-supervised Learning, Unsupervised Learning
PDF Full Text Request
Related items