Font Size: a A A

Research Of Text Keyword Retrieval Technology For Secrecy Inspection

Posted on:2018-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:Z G WangFull Text:PDF
GTID:2348330542490974Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Secrecy inspection is an important means which safeguard national information security.With the intensity of secrecy inspection gradually increased,document confidential information inspection is the focus of the current research for checking tool.As the storage capacity of the computer is increasing,followed by the massive file data information,which greatly prolongs the time of file secret information inspection,so the traditional pattern matching algorithm has been difficult to meet the current mass file pattern matching speed requirements.On the other hand,the current documents confidential information check mostly only for the text in the document,ignoring the images in the document to be checked,these images will still exist important confidential information,resulting in the current document confidential information check is incomplete,it is far less than the confidentiality check with efficiency and accuracy requirements.This paper focuses on the research of text key words retrieval technology for secrecy inspection,including the research of text extraction technology in image and the study of multi-pattern string matching algorithm.The paper focuses on the key techniques of text keyword retrieval,and the main works are as follows:(1)A text extraction method based on non-subsampled Contourlet transform is designed.The method have three steps.Firstly,the image to be processed is decomposed by Gaussian pyramid,and the different resolution images are obtained.Secondly,the non-subsampled Contourlet transform is used to locate the text area,and the final text area is obtained by synthesizing the position of the text area under each resolution.Finally,the global threshold binarization of the text region is performed,and text area input OCR system to recognition,access to extract the results of the text file.(2)A multi-pattern string matching algorithm based on jump table and double hash technique is designed.The algorithm is divided into three steps.First,the pattern matching algorithm can be divided into two stages,preprocessing stage and search matching stage.In the preprocessing phase,a character shifting table is created,which is used to transform thesearch window during pattern matching.Then,a first-level hash table and a second-level hash table are created,which are used for the search of the rule pattern to be matched.Finally,in the search stage,based on the shifting table,the first-level hash table,the second-level hash table to be matching text in the regular pattern scan matching to find all hit positions of patterns.The results show that the proposed image text extraction method uses the ICDAR data set compared with the existing typical method have higher image text extraction rate and accuracy rate.The proposed multi-pattern string matching algorithm uses the Reuters-21578 news data set to compare with the existing classical algorithms,and has relatively high time efficiency and space efficiency.Therefore,the text keyword retrieval techniques can be used for secrecy inspection.
Keywords/Search Tags:secrecy inspection, image text extraction, keyword retrieval, double hashing, non-subsampled Contourlet transform
PDF Full Text Request
Related items