At present,network technology has penetrated all aspects of people’s lives.China has formed one of the largest and most vibrant digital societies in the world.At the same time,various network threats make China’s network security situation is not optimistic.Therefore,cyber security is particularly important.With the rapid development of AI,it provides new technical support for cyber security in recent years.Many analysis models have been deployed on some network facilities,such as Firewall and IDS.However,the current works are still at the stage of complex analysis of high-dimensional data or raw traffic data.Most of the researches only focused on the final result,with little attention to the underlying patterns embedded in the network traffic behavior.Using the analysis model as a ”black box” is not conducive to improving the reliability of analysis models,and it also cannot contribute to the ability improvement of manual analysis with the aid of AI.The background of our research is the wide application of AI in various fields,especially the network traffic analysis based on machine learning.In this paper,we focus on the intrinsic mechanism of network traffic identification and detection,and aims to explore the universal laws implied in network traffic analysis.The study takes four research directions as the starting point: the simplest feature scheme of network traffic,the relationship between bypass-information and content,the interpretability of analysis models,and the concrete characterization scheme in low-dimensional space.For each direction,four application scenarios of network traffic analysis are selected corresponding to 4 directions,respectively.Specifically,the main contributions of this paper are illustrated as follows.(1)To address the shortcomings of the complete-flow-based method and the necessity problem for deep parsing packet payload,an identification method based on packetsequence and byte-distribution is investigated in the context of application traffic identification.The purpose is to explore the simplest feature scheme of network traffic and analyze the intrinsic basis of the task.According to experimental results,it can be considered that packet-sequence and byte-distribution are the simplest feature scheme for network traffic identification tasks under certain conditions without capturing the complete flow and deep parsing of network protocol fields.(2)In order to solve the difficulties in content analysis of encrypted video traffic,taking encrypted DASH video traffic analysis as an example,a fingerprint construction scheme based on higher-order Markov chains and unsupervised analysis including threshold and clustering based on Levenshtein distance are investigated to mine the intrinsic relationship between the timing features and the content of video stream.The experiments show that the content information is not only contained in the payloads,but also inevitably leaked in the timing information,which enables the content analysis of encrypted video traffic based on titles.Consequently,the bypass-information still contains a large amount of valuable content-related information even with encryption of network traffic,which can be used for network behavior analysis.(3)To address the lack of interpretation for current network traffic analysis models,an interpretable encrypted malicious traffic detection framework called DEV-ETA is proposed.This framework introduces interpretable machine learning into traffic analysis models for the first time,and enhances the credibility of the model by explaining its detection result.Experiments show that the backward interpretation methods,such as LIME,SHAP,and MSS,can be used to explain encrypted malicious traffic detection models.The interpretation results can basically pass the final validation.In DEV-ETA,the network traffic detection work forms a complete closed-loop of ”detection-explanation-validation”,which avoids the existing problem of ”emphasizing results but not interpretation”.(4)A new concept,Flow Spectrum,is proposed for the first time to solve the problem of concrete characterization of network behavior.The Flow Spectrum transforms each network flow into a one-dimensional spectrum,i.e.,the current problem of analyzing high-dimensional network traffic data is transformed into the process of comparing the Flow Spectrum in one-dimensional space.It can significantly reduce the analysis complexity.By studying the characterization of network behavior in the low-dimensional domain,we search for the flow intrinsic indexical principles reflecting the network behavior.On the NSL-KDD dataset,we generated Flow Spectrum for different attack types based on a semi-supervised Auto Encoder,and conducted detection experiments and characterization effect analysis.The experimental results demonstrated the feasibility of the idea of Flow Spectrum,initially. |