| With the development of network technology,data such as text,images,and videos are experiencing explosive growth,and a large amount of network traffic data is also generated during the data transmission process.Facing these massive amounts of data,using classification technology to efficiently manage,analyze,and predict them can provide reliable behavioral decision-making references and convenient personalized services for people.Discrete data has attracted attention due to its closer representation at the knowledge level in the field of deep learning and being a fixed form of algorithm input.The article selects two types of discrete data with similar composition structures,text and network traffic,for research on classification algorithms.For multi classification scenarios in the text field,the algorithm is improved based on Transformer,and high accuracy text classification is achieved through multiple representation extraction and decision fusion mechanisms.For a small number of labeled training sample scenarios in the field of network traffic,a pre-trained task based on Transformer is designed to extract network traffic representation,and downstream fine-tuning is performed according to different classification scenarios to improve the classification accuracy and stability of network traffic classification tasks.In addition,the paper also developed a network traffic behavior representation analysis and detection system to implement network traffic classification algorithms.The research content is as follows:1.A text classification model based on multivariate representation extraction and decision fusion is proposed.Text sequence is a typical discrete data,and text classification technology is widely used in practice.In order to solve the problem of low precision and instability of classification result due to too many classification labels in the multiclassification scenario,the paper adopts a hierarchical training method to reduce the number of classification labels which can cut down the influence of too many labels on model training.Extracting multiple representations based on stacked Transformers and multi-scale convolutions,and adopting decision fusion mechanisms to make optimal behavior or semantic label selection,improves the stability and classification accuracy of the model on different classification targets.The algorithm has been validated on the customs cargo coding classification datasets HS-1 and HS-2,and the classification accuracy has been significantly improved compared to the comparative models,providing customs personnel with high accuracy in customs declaration behavior selection.2.A network traffic classification algorithm based on pretrained Transformer is proposed.Network traffic messages are composed of character sequences and are also a kind of discrete data.Aiming at the problem that network traffic will lead to low classification accuracy under the condition of a few labeled training samples,the pre-trained model in NLP is referenced to the traffic.Perform text-like preprocessing of the traffic and build a self-supervised task to train model,and then perform downstream fine-tuning according to different classification scenarios.The algorithm is verified in four public datasets,attack behavior and protocol classification scenarios,which proved the correctness and effectiveness of the algorithm.3.A network traffic behavior representation analysis and detection system is developed.Based on the algorithm theory of the second research point,a system integrating network traffic behavior representation analysis and anomaly behavior detection is developed to address the problems of low detection accuracy,unable to extract intermediate network traffic representation data,and complex interface in existing anomaly behavior detection platforms.Users can not only analyze and extract network traffic representation through a simple and user-friendly interface,but also achieve abnormal behavior detection for batch data,meeting users’representation analysis and detection needs,and bringing better experience to users. |