Font Size: a A A

PDF Document Parsing And Content Desensitization Techniques

Posted on:2019-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:L Y ZhuFull Text:PDF
GTID:2348330569988907Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the dawn of big data era,the way people access information resources becomes much more convenient,but it also brings about problems like unwanted information release and dissemination of sensitive information.Under the condition of ensuring the openness and sharing of data resources,how to prevent the unwanted disclosure of confidential information has becomes a key issue that needs to be solved carefully.As a widely used electronic document,the related research work on prevention and control of PDF document content find potential applications.Most of the existing PDF document parsing tools can handle only the local(offline)PDF files,few is able to deal with online PDF documents.Secondly,there are few tools for the desensitization of confidential messages in PDF document.Considering the widespread dissemination of PDF electronic documents in future networks,there is indeed a urgent need for prevention and control of the proliferation of user-specified sensitive information in the contents of PDF documents.The paper focuses on the analysis of online PDF document resolution and the related content desensitization techniques.Firstly,based on the brief survey of challenges faced by current network information security and technical solutions,the thesis demonstrates the necessity of desensitization of sensitive information in electronic documents.Then,based on the analysis of the current PDF file parsing technology and existing problems,this thesis proposes a PDF file parsing framework.This method is not only suitable for local PDF files,but also applicable to the online PDF documents in the network.On the basis of parsing the PDF file,in order to efficiently locate and confirm the confidential message specified by the user,motivated by the analysis and comparison of the classical matching algorithm of both BM algorithm and QS algorithm are addressed.In this thesis,an efficient keyword matching algorithm is proposed by considering the features of PDF file text content encoding.The experimental verification results show that the new algorithm can effectively improve the matching efficiency.Finally,within the framework of reverse proxy mechanism,this thesis presents a solution of PDF document parsing and content desensitization based on on-line PDF file identification,confidential content identification location and the related content desensitization processing.Function test and system performance test are also carried out.The experimental results show that,the online PDF file recognition,confidential content identification,and confidential content desensitization processing can meet the practical application requirements.The related analysis work of the paper has certain reference value for the further study of the effective network electronic document content control technology.
Keywords/Search Tags:content prevention and control, content analysis, content desensitization, PDF document
PDF Full Text Request
Related items