With the continuous enrichment of network application types and the explosive growth of network traffic,how to flexibly adjust the network to meet the needs of diversified users has become an urgent problem in the “Internet +” era.Classifying and identifying the data traffic of the entire network link is a prerequisite for achieving control.Mastering the traffic distribution of the whole network link is helpful for the upper layer network management application to deploy strategies according to the existing network conditions.However,the existing recognition technology is faced with many difficult problems,such as the machine learning algorithm is easy to bias to the majority of class samples in the unbalanced data set,resulting in high overall error rate of the model;It is necessary to select features with high class recognition and low redundancy for network traffic,construct training sample sets,and reduce the time and space overhead of model training.This thesis studies the network traffic classification and mobile traffic APP identification for the above issues.The main work is divided into the following two parts.Firstly,an improved data balancing algorithm based on random forest is proposed to classify network application traffic.1.For the problem of category bias of unbalanced sample sets,this thesis proposes an improved algorithm for data balancing based on sparsity weighting.The improved algorithm fully considers the distribution characteristics of the minority samples and the fuzzy boundary conditions of the edges when sampling the new samples to avoid the negative impact of the loss of information richness on the model training.At the same time,the new minority classes are synthesized by linear interpolation between the minority samples and their neighbors The sample method avoids over-fitting the model during training by directly copying a small number of samples.At the same time,the new minority samples are synthesized by linear interpolation between the minority samples and their neighbors,which avoids the over fitting of the training model caused by directly copying the minority samples.2.When selecting the optimal feature subset,the information gain and application category correlation are synthetically measured to obtain an efficient comprehensive feature evaluation index,which reduces the performance overhead of the system.Secondly,in the C4.5 decision tree based mobile app traffic identification method,this thesis optimizes the hundreds of thousands of mobile app traffic data collected by Wireshark software.1.Data packet length and time interval of packet arrival are used as feature extraction objects.Compared with TCP session,burst is introduced as the basic unit of traffic collection in data preprocessing.The behavior of mobile app is characterized by fine-grained,and provides support for online classification of network traffic.2.When selecting the optimal feature subset,the Pearson feature dimensionality reduction method based on category-related mutual information is used to reduce the performance impact of the target variable due to entropy changes on classification,which improves the robustness of model classification and reduces model complexity.Combining the above performance optimization methods,the framework of network traffic classification and mobile traffic APP identification model built in this thesis is lightweight,highly identifiable and scalable,and is suitable for real network application scenarios. |