Font Size: a A A

Research On Sensitive File Detection Technology Based On Image Recognition

Posted on:2020-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ZhouFull Text:PDF
GTID:2428330572961786Subject:Engineering
Abstract/Summary:PDF Full Text Request
Nowadays,with the rapid development of digital technology,some sensitive information files involving enterprises and governments often appear on the Internet for reasons of theft and leak.The leakage of these documents often brings about significant negative impact on the government or enterprises.Therefore,how to discover these sensitive documents have become one of the hot topics in the current information security field.Traditional sensitive document detection is mostly achieved through specific keyword matching(such as secrecy,confidentiality,top secret and other keywords),but Chinese is a kind of text that emphasizes "parataxis" and lightens the form,sentence ambiguity is very common.For the existence of semantic ambiguity,sensitive file detection based on keyword matching has the characteristics of poor detection accuracy and cumbersome subsequent keyword expansion.At the same time,many leaked files are often photographed and then transmitted on the network.The detection method of keyword matching is completely invalid.Aiming at the common situation of taking photos and leaking sensitive information files on the Internet.Firstly,this paper designs an efficient algorithm to judge whether it is a Chinese text image file.Then uses OCR processing to realize text extraction.Finally,use text corpus samples library based on deep learning technology.The model built by the library is trained to detect.The main work of this paper is summarized as follows:(1)An improved Chinese text image detection algorithm(SWT)based on stroke width is proposed.Firstly,using the feature that the text stroke width is relatively fixed,the edge detection is extracted by edge detection of the image by the canny operator.Secondly,find the edge pixel points on the edge of the text edge that match the threshold of the stroke width direction,and calculate the stroke width distance between the pixel points.Then,the width information of the stroke width path larger than the stroke medium value is updated to the stroke width median information,and the SWT image containing the stroke width information of each pixel is output.Finally,according to the character candidate area related filtering rules,the words are combined into a line,and the four heuristic rules designed for the Chinese text image are combined to further improve the detection effect of the Chinese text image file.(2)A deep learning sensitive file detection method based on Bi-directional Long Short-Term Memory neural network(Bi-LSTM)combined with Hierarchical Attention Networks mechanism(HAN)is proposed.Firstly,according to the definition of sensitive documents in the relevant national confidentiality regulations,a comprehensive selection of five types of sensitive documents: "politically sensitive","religiously sensitive","military sensitive","human rights sensitive" and "non-sensitive" were selected.Then,sensitive documents corpus for training is collected,labeled and constructed.Secondly,according to the training characteristics of text corpus,the constructed corpus is vectorized to meet the requirements of deep learning for input data format.Finally,a neural network model based on Bi-LSTM and HAN is proposed to train the collected text corpus and realize the detection of sensitive files in image form(3)Using the above algorithm to build a verification demonstration system.The system is mainly divided into three parts: image preprocessing,image OCR,and text image sensitive file detection.In terms of image preprocessing,this paper provides a corresponding correction function for the tilting and perspective phenomena that occur during the image file photographing process,so that it can achieve better recognition effect in the image OCR stage.In the detection of sensitive documents,through the design of Bi-LSTM and HAN based detection model,the OCR recognition of the extracted text is sensitively tested to meet the system design requirements.
Keywords/Search Tags:Chinese sensitive file, stroke width algorithm, discrete Fourier transform, bi-directional long short-term memory, hierarchical attention mechanism
PDF Full Text Request
Related items