Font Size: a A A

Desigh And Implementation Of Internet PDF Document Privacy And Sensitivity Content Control System

Posted on:2020-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:B G LiuFull Text:PDF
GTID:2428330590496430Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the network is flooding all corners of our lives.While bringing convenience to people,the Internet has also become a medium for the rapid spread of pornographic,violent,reactionary and other sensitive texts.As a kind of portable document form,PDF is more and more appearing in network transmission.The spread of bad text information carried by it not only pollutes the network environment,but also endangers the harmony and stability of society.However,the research on the prevention and control of PDF text content is not fully mature.How to accurately and efficiently identify the sensitivity of text content in PDF documents is an important issue.Due to the particularity of Chinese characters in PDF documents and the lack of sufficient open source resources,the research on the prevention and control of sensitive content of Chinese PDF documents in online network environment is still insufficient.Therefore,the prevention and control of sensitive content of PDF documents for network transmission is still in progress.It is a key issue to be solved in the field of network security.Since the sensitive content prevention and control system of PDF documents needs to operate in a real-time online network environment,in the process of PDF document analysis and sensitivity discrimination,there is a high requirement for processing rate and recognition accuracy.Based on the realization of the network PDF document sensitive content prevention and control system,this thesis proposes a PDF document stream label fast positioning algorithm-SLQP algorithm,a text content efficient matching algorithm-PB-WM algorithm,as well as a attention-based bidirectional regional LSTM network model for target sentiment analysis in order to optimize the recognition accuracy of the system.For the positioning of stream content tags,it is actually a special single-pattern matching problem.The pattern features and data types are clear in this problem.Howerer,the common single-pattern matching algorithm can not take advantage of these features.On the basis of this observation,a simple and efficient single pattern matching algorithm-SLQP algorithm is designed in this thesis.The SLQP algorithm is more efficient than other conventional algorithms for PDF stream positioning when compared in actual live network environments.For the sensitive content review of PDF text,especially for the multi-pattern matching problem of simultaneous matching of multiple sensitive words,this thesis studies and implements an efficient multi-pattern matching algorithm-PB-WM algorithm for Chinese PDF text based on the Chinese PDF text encoding rules.On the text content matchingproblem of Chinese PDF,it is shown through the experimental comparison that PB-WM algorithm can achieve higher matching efficiency than other multi-pattern matching algorithms.In order to make the sensitive content prevention and control system more accurate for the discrimination of PDF documents,this thesis proposes a dual judging scheme to discern sensitive words' emotional polarity.Specifically,attention-based bidirectional regional LSTM network model is constructed to perform target sentiment analysis on sensitive words for accurately discerning the meaning of sensitive words.In addition,in order to expand the range of sensitive words,the sensitive content prevention and control system can identify synonyms of sensitive words based on Word2 vec,which makes the system more complete.Finally,the thesis uses the reverse proxy mechanism as the framework to intercept the TCP traffic and extract the PDF documents for content sensitivity discrimination,thus realizing the sensitive content prevention and control system of the network PDF document.Experiments show that the system can meet the real-time sensitive discrimination of PDF documents in online networks.The analysis work of the thesis can provide useful insights for the future research on effective PDF content prevention and control technologies.
Keywords/Search Tags:Privacy and sensitivity control, PDF document, Content extraction, Pattern matching, Target sentiment analysis
PDF Full Text Request
Related items