Font Size: a A A

Research On Unstructured Threat Intelligence Entity Extraction Method Based On Machine Learning

Posted on:2023-10-14Degree:MasterType:Thesis
Country:ChinaCandidate:C Q WangFull Text:PDF
GTID:2558306761487844Subject:Engineering
Abstract/Summary:PDF Full Text Request
Cyber threat intelligence is a key link in defending against identified cyber attackers and predicting possible attacks.The research on intelligence collection,analysis and sharing technologies has become a hotspot in the field of cyberspace security.The initial form of cyber threat intelligence mostly exists in the unstructured form of text,which cannot be directly utilized and processed in batches.The entity relation extraction method in natural language processing technology is widely used in unstructured to structured scenarios of threat intelligence.Due to the large number of professional terms in threat intelligence and different sentence expressions of different information sources,traditional natural language processing technology cannot be directly applied to entity relationship extraction in cyber threat intelligence text,which is prone to problems such as poor effect and low recognition accuracy.In order to improve the automatic extraction capability of cyber threat intelligence,this thesis carries out the following research work from the aspects of threat intelligence data set construction and unstructured entity relationship extraction:(1)Build a cyber threat intelligence entity-relationship annotation dataset.Obtain threat intelligence texts from authoritative cybersecurity websites to ensure the reliability of the intelligence.According to the international standard STIX for threat intelligence sharing,through the three steps of intelligence text preprocessing,entity labeling and relation labeling,an entity dataset and entity relation dataset in the field of threat intelligence are constructed,with a scale of 65885 entities and 4423 pairs of entity relations respectively.The detection shows that the dataset has high consistency,which can provide domain-related datasets for the next step of entity relation extraction research.(2)Because traditional word embedding uses fixed word vector and gets little characteristic information,it can’t identify the key entity information of threat intelligence well.A threat intelligence named entity recognition model based on ALBERT and BILSTM-CRF is proposed.Use the domain dataset to train the language model ALBERT and adjust the parameters to obtain the dynamic feature word vector in the threat intelligence domain.The Bi LSTM-CRF model learns the textual context semantic information and finally outputs the predicted label sequence.Compared with similar models in public literature,the proposed model has greatly improved in terms of accuracy,recall and F1 value,especially the space utilization.(3)Aiming at the problem of insufficient domain features and poor identification accuracy of entity relation extraction model,a threat intelligence entity relation extraction model combined with Bi GRU-Multi Att was proposed based on trained pre-training language model.The proposed model has two improvements,one is to integrate word embedding and positional embedding in the pre-trained embedding layer,the other is to learn the contextual semantic information features through Bi RGU,and integrate the multi-head attention mechanism to enhance the keyword learning ability.The experimental analysis on the threat intelligence entity relationship dataset shows that the proposed model is 4% higher than other models in terms of recognition accuracy,and can be effectively applied to entity relationship extraction in the field of cyber threat intelligence.
Keywords/Search Tags:Cyber Threat Intelligence, BERT, Named Entity Recognition, Relation Extraction, BiGRU
PDF Full Text Request
Related items