Font Size: a A A

Research On Sensitive Document Recognition Based On Text Content

Posted on:2021-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:Q N ShenFull Text:PDF
GTID:2428330611997417Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the improvement of information electronization,the leakage of sensitive information is on the rise,and the loss and impact caused by it is also growing.Text documents,as an important carrier in information transport and storage,more and more security events caused by it.With the big data environment,how to identify sensitive information in a large number of text documents to facilitate the follow-up work of leakage prevention is an important issue in security field during recent years.Traditional sensitive document detection is usually based on keyword matching or text statistical characteristics.The limitations of these two methods lie in that,on the one hand,they rely on manual keyword dictionary and annotation,which requires large amount of manual operation.On the other hand,they neglect word order and context information,fail to fully mine the intrinsic meaning of text,so their performance are not good enough to deal with the complexity of sensitive document detection.With the rapid development of natural language processing theory and technology,some scholars used deep learning method to recognize sensitive documents through text classification.To a great extent,this recognition method depends on the ability of the model to represent the sensitive text content.Because of the particularity of sensitive documents,the number of available training samples for learning is not enough to support the model to obtain high-quality word vector representation.As the basic unit of text,word vector quality has a significant impact on text content representation.In addition,the sensitivity of words is closely related to context.For example,the term ‘force deployment' is highly sensitive in military documents,but less sensitive in news and popular books.Therefore,this paper studies sensitive document recognition methods from the perspective of text content representation,as follows:1.In order to solve the contradiction between insufficient training samples and semantic expression ability of model,this paper introduced pre-training word vector to enrich the linguistic knowledge of the model,and proposes an improved Elmo dynamic word vector generation model.Then we constructed the sample set to simulate the scene of sensitive document recognition to verify our improved model by means of transferring context,adding noise and setting the unknown words.The result of experiment showed that the context sensitive word vector has a significant advantage over the static word vector in above scenarios,and the improved model is superior to the original model in semantic expression and training speed.These results verified the positive impact of the text content expression ability on the detection and recognition.The effectiveness of the improved Elmo are also verified.2.To solve the problem of manual annotation and feature selection in traditional machine learning recognition methods,considering the advantages of deep learning methods in the face of complex problems such as data imbalance and generalization ability,this paper used CNN's attention mechanism to connect bi-RNN network in parallel,and proposed the BGCBA(Bi-GRU-CNN based on Attention)model to extract text feature,mining semantic features as much as possible for detection in limited sensitive samples.Experiment showed that,compared with single neural network,BGCBA model has better classification performance.3.Based on above methods,this paper proposed a sensitive document recognition model based on pre-training word vector,designs and implements a sensitive document recognition system.The module function and system performance were tested,which verifying the effectiveness and practicality of the model.
Keywords/Search Tags:sensitive information, text classification, word vector, convolutional neural network, data leakage prevention
PDF Full Text Request
Related items