PDF Document Parsing And Content Desensitization Techniques

Posted on:2019-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:L Y Zhu

Full Text:PDF

GTID:2348330569988907

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet and the dawn of big data era,the way people access information resources becomes much more convenient,but it also brings about problems like unwanted information release and dissemination of sensitive information.Under the condition of ensuring the openness and sharing of data resources,how to prevent the unwanted disclosure of confidential information has becomes a key issue that needs to be solved carefully.As a widely used electronic document,the related research work on prevention and control of PDF document content find potential applications.Most of the existing PDF document parsing tools can handle only the local(offline)PDF files,few is able to deal with online PDF documents.Secondly,there are few tools for the desensitization of confidential messages in PDF document.Considering the widespread dissemination of PDF electronic documents in future networks,there is indeed a urgent need for prevention and control of the proliferation of user-specified sensitive information in the contents of PDF documents.The paper focuses on the analysis of online PDF document resolution and the related content desensitization techniques.Firstly,based on the brief survey of challenges faced by current network information security and technical solutions,the thesis demonstrates the necessity of desensitization of sensitive information in electronic documents.Then,based on the analysis of the current PDF file parsing technology and existing problems,this thesis proposes a PDF file parsing framework.This method is not only suitable for local PDF files,but also applicable to the online PDF documents in the network.On the basis of parsing the PDF file,in order to efficiently locate and confirm the confidential message specified by the user,motivated by the analysis and comparison of the classical matching algorithm of both BM algorithm and QS algorithm are addressed.In this thesis,an efficient keyword matching algorithm is proposed by considering the features of PDF file text content encoding.The experimental verification results show that the new algorithm can effectively improve the matching efficiency.Finally,within the framework of reverse proxy mechanism,this thesis presents a solution of PDF document parsing and content desensitization based on on-line PDF file identification,confidential content identification location and the related content desensitization processing.Function test and system performance test are also carried out.The experimental results show that,the online PDF file recognition,confidential content identification,and confidential content desensitization processing can meet the practical application requirements.The related analysis work of the paper has certain reference value for the further study of the effective network electronic document content control technology.

Keywords/Search Tags:

content prevention and control, content analysis, content desensitization, PDF document

PDF Full Text Request

Related items

1	Word Document Parsing And Content Desensitization Techniques
2	Sensitive Content Prevention And Control Technology Based On Network Platform
3	Research On Content Analysis In A Methodological View
4	Desigh And Implementation Of Internet PDF Document Privacy And Sensitivity Content Control System
5	Research On Content Scheduling Technologies In Content Networks
6	Design And Implementation Of Document Management System Based On Content
7	Research Of The Content Processing In Gigabit Network Intrusion Prevention System
8	Research On The Adapting Content Pipeline In Content-centric Networking
9	NetEase "Easy Moment" Content Operation Research
10	Content-based Business Document Modeling Study