Font Size: a A A

Research On Named Entity Recognition Method For Network Security Domain

Posted on:2024-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:D L LiFull Text:PDF
GTID:2558307097471554Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Named entity recognition is an important part of knowledge extraction and is the first task of building knowledge graphs.How to quickly and accurately identify and extract useful information from a large amount of text is a hot issue of academic research in recent years.And with the advent of the era of big data,network intrusion,virus infection and other network attacks have become more and more frequent,and network attacks have seriously affected the security of computer usage.Without network security,there is no national security.In order to ensure the security of cyberspace,the state monitors the network in real time through various technologies,which generates a large amount of network security data.In this paper,we study the application of neural network model to the entity recognition work of cyber security vulnerability events based on deep learning technology,by word embedding then encoding and finally decoding using conditional random fields to finally realize the entity recognition work.For the study of cybersecurity named entity recognition,a named entity recognition neural network model incorporating multiple sources of information of Chinese characters is proposed,and a cybersecurity entity recognition corpus is constructed for the problem of lacking a corpus of named entity recognition in the domain.The details of the study are as follows:(1)Constructing a corpus of entity identification in cybersecurity domain.To address the lack of public cybersecurity entity identification corpus in the cybersecurity field,the information of national security vulnerability database is collected as the text data source of the corpus to ensure the real validity of the data source.The collected data includes vulnerability information of operating system module,application module,database module,web application module,network device module and other modules in the past five years to ensure the comprehensiveness and timeliness of the corpus.The corpus goes through two stages: the pre-labeling stage and the final labeling stage.Experts in the field of cybersecurity develop annotation rules and specifications,and thus develop annotation tools to train annotators.The final entire cybersecurity corpus contains 400,000 words,which are annotated according to the BIO approach and distributed in the ratio of training set: validation set: test set = 6:2:2.(2)A neural network model for cybersecurity named entity recognition that fuses information from multiple sources of Chinese characters is proposed.To improve the accuracy of the neural network model,the model uses the output of the last layer of the pre-trained model BERT as the original word embedding,and vector splicing and fusion of information such as paraphernalia and word frequencies of text in the corpus to provide enough prior knowledge,further fusion of lexical information while feature extraction is performed in the coding layer,and final decoding is performed by conditional random fields.In order to verify the generalizability of the model,comparison experiments with common neural network models on public domain datasets are conducted,and the model performs well.To demonstrate the effectiveness of the model in the cybersecurity domain,comparison experiments with common models on constructed cybersecurity domain datasets are conducted,and the experimental results of accuracy,recall and F1 values are 0.8649,0.8402 and 0.8523.(3)Designing and implementing a network security entity identification system.We constructed a network security entity recognition system based on the proposed network security named entity recognition neural network model that integrates multi-source information of Chinese characters to improve the accuracy and efficiency of named entity recognition in the field of network security.The whole system is simple and practical,frontend and back-end classification,based on python and HTML and other languages,which can significantly improve the efficiency and accuracy of entity identification in the field of network security.
Keywords/Search Tags:Named entity recognition, cybersecurity, corpus construction, pre-trained model, word vector fusion
PDF Full Text Request
Related items