With the vigorous development of network communication technology,there has been an endless stream of network security problems.More and more organizations are suffering from advanced persistent threats.In response to these new attacks,security defenders use threat extraction technique to identify tactical and technical means in unstructured threat intelligence quickly,thereby increasing the speed of defense.Considering that a piece of unstructured threat intelligence often involves multiple tactics and techniques,this dissertation models the threat extraction task as a multi-label text classification task and regards the extracted tactics and techniques as labels.Firstly,this dissertation constructs the threat intelligence dataset based on ATT&CK.Secondly,in order to support subsequent research,the statistical analysis is carried out on the threat intelligence dataset and the general dataset in the field of multi-label text classification.The problems in the datasets are summarized as follow.On the one hand,the distribution of labels exhibits a long tail,so that there is a problem of sample imbalance.Combined with the correlation analysis between labels,this dissertation uses the correlation of labels to transfer the rich semantic information learned from the head labels to the tail labels,thereby making up for the poor classifying effect of the tail labels.On the other hand,unstructured text is verbose,especially in threat intelligence dataset,which undoubtedly increases the difficulty of threat extraction.This dissertation uses the semantic correlation between labels and text to highlight words with categorical information,thereby reducing the interference of invalid semantic words.Therefore,this dissertation proposes corresponding solutions based on mining the correlation of labels and the correlation between text and labels,then designs and implements a threat extraction system.The main contents of this dissertation are as follows:(1)In order to effectively mine the correlation between labels and extract the discriminative information of labels from the text,a multi-label text classification method based on label combination and fusion of attentions is proposed.Considering that the cooccurrence relationship between labels can clearly reflect the relevance of labels,based on the idea that similar labels always appear in similar texts in the form of combination,a pretraining enhancement strategy based on label combination is designed.In the pre-training stage,the encoder is trained by sampling multiple texts that are similar or dissimilar in the form of label combinations,thereby capturing the correlation between labels and the semantic overlap between similar texts.In the training stage,global information and fine-grained semantic information are obtained through self-attention and label attention enhanced by multi-layer dilated convolution respectively,and then the two kinds of information are adaptively fused and input into the multi-layer perceptron for multi-label prediction.Experiments are conducted on the threat intelligence dataset and two general datasets in the field of multi-label text classification,and the results show that the method achieves significant improvement in micro-F1 indicator.(2)In order to strengthen the correlation learning between text and labels and speed up model reasoning,a multi-label text classification method based on joint embedding and twostream interaction is proposed.Firstly,text and labels are mapping to the same space through the joint embedding module,perceiving the relevance of text and labels as well as the correlation between labels.Secondly,the text internal association module is introduced to capture the long-range dependencies of characters in the text through the self-attention mechanism and positional encoding.And the mix depthwise convolution feed forward network is further utilized to integrate the local information to obtain the global text representation.Then,the text and label association learning module is designed.Relying on the cross-attention mechanism,the label embedding obtained by the joint embedding module is used as label queries to interact with the text representation,adaptively extracting the finegrained dependencies between each label and text.Finally,a weighted fusion strategy supervised by multiple loss functions is used to fuse the outputs of the two modules to further optimize the prediction results.The experimental results on multiple datasets show that the proposed method can effectively extract the correlation between text and labels,highlight keywords and reduce the interference of invalid semantic words while optimizing the inference speed and reducing the number of parameters,thereby improving the prediction performance of labels.(3)In response to the functional requirement of automatically extracting threats from unstructured intelligence,a threat extraction system based on the Flask framework is designed and implemented.Since verifying the effectiveness of above methods,the two methods are applied to the threat extraction system.The system integrates and implements two main function modules: threat extraction and fast extraction,which can load corresponding models to conduct automatic threat extraction.The extraction results can be converted into machinereadable threat intelligence and stored locally.In addition,users can customize and adjust the prediction results and upload them to the server,so that the training set and model can be iteratively updated.In summary,in order to achieve efficient threat extraction,this dissertation proposes solutions from two aspects: dataset construction and method design.Extensive experimental results show that the proposed methods can effectively improve the performance and efficiency of threat extraction.The threat extraction system based on the two methods shows that the work in this dissertation has certain academic value and practical application value. |