Font Size: a A A

Word Document Parsing And Content Desensitization Techniques

Posted on:2019-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y T LiaoFull Text:PDF
GTID:2348330563954542Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,the popularity of network allows any user to access almost any type of information easily.However,the proliferation of information in cyberspace without necessary control has become one of the most serious challenges and problems in the further development of the Internet.Most of the existing network content prevention and control technologies are based on Web page content capturing solutions.At the moment,there are few technologies that are capable of supporting content de-sensitization for files transferred through the network(such as Word document,etc.).In addition,most of the current engine and network information mining technologies are assumed to operate on the content platform and service platform to realize relevant data collection,de-sensitization and processing only.There are few techniques that can capture and analyze content flows in the network.Considering the widespread dissemination of Word documents in the network,there are emerging requirements for the proliferation control of user-specified sensitive information in Word documents through network.This paper focuses on the online Word documents and content desensitization techniques.Firstly,the paper focuses on the analysis of text content in Word document.There are various versions of Word document.On the basis of analysis and study of the Word file parsing method in both DOC and DOCX file formats,this thesis presents a detailed operation flow chart of Word document content parsing.Secondly,motivated by the classical character string matching algorithm,two improved BMHS pattern matching algorithms(improved algorithm model I and improved algorithm model II)are developed.Our analysis results show that,when compared with the BMHS algorithm,the improved BMHS algorithm can effectively reduce the number of matching,improve the matching efficiency,and fulfill the needs of fast matching of the specified words for the Word text content desensitization application.In order to meet the needs of desensitization of similar keywords,the Word2 vec algorithm is studied in this paper as well.The results show that,with the Word2 vec algorithm,we can effectively handle the desensitization of similar keywords in Word document.Finally,by using the reverse proxy mechanism,this thesis presents a solution to Word document parsing and content desensitization based on network Word file recognition,sensitive content identification and sensitive content desensitization framework.The utilized TCP reverse proxy module,the connection management and log system module,the content analysis module and the interface system module will be introduced in detail.And the thesis presents the functionality and the stress test to show that,the proposed Word text content parse and desensitization is able to fulfill the practical Word text content desensitization requirements.The related analysis work of the paper has certain reference value for the further study of the effective network electronic document content control technology.
Keywords/Search Tags:Content prevention and control, Content Resolution, Content desensitization, Word document, BMHS algorithm
PDF Full Text Request
Related items