Font Size: a A A

Design And Implementation Of Electronic Document Sensitive Information Detection System Based On Content Similarity

Posted on:2021-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:F HouFull Text:PDF
GTID:2568306104463974Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Nowadays,with enterprises highly informatized,the core confidential data in the intranet computers of the enterprise is ubiquitous.Due to the lack of detection tools for sensitive information of electronic documents,leakage of documents containing confidential content often occurs,which may result in irreparable losses for enterprises.In order to effectively identify these electronic documents with sensitive information and prevent the occurrence of leaks,this paper has designed and developed a set of electronic document sensitive information detection system based on the similarity of text content,which is a breakthrough of traditional strategies.Firstly,a document detection strategy based on the file fingerprint algorithm is proposed.Aiming at the shortcomings of the traditional Simhash fingerprint algorithm,three new fingerprint algorithms are proposed by improving the feature extraction method:the Kb S(Keywords based Simhash)fingerprint algorithm,the Pb S(Paragraphs based Simhash)fingerprint algorithm and the So P(Simhash of paragraphs)fingerprint algorithm.This paper also analyses the advantages of the above three new fingerprint algorithms in detecting different sensitive documents.On this basis,this paper further explores the impact of different degrees of content modification of secret documents on the calculation of Hamming distance of digital fingerprints.Moreover,this paper then verifies that the fingerprint strategy can identify sensitive information of secret documents with content changes,and provides a basis for setting sensitive thresholds for detection strategies.Secondly,a document detection strategy based on semantic VSM algorithm is proposed.Aiming at the shortcomings of the traditional vector space model,the similarity calculation method based on word semantics is studied.This method improves the traditional VSM by adding semantic concepts,establishes the Hownet VSM similarity calculation method,and verifies the advantages of the improved algorithm in content similarity calculation through clustering experiments.On this basis,this paper further explores and analyses the similarity numerical relationship between the process document and the original confidential document,which provides a basis for setting sensitive thresholds for semantic VSM detection strategies.Finally,this paper develops the electronic document sensitive information detection system and tests the function and performance of implement.The overall structure of the system,the flow of main functional modules and the structure of the database are designed.On this basis,the system’s front-end and back-end code implementations have been completed,and test experiments have been established to verify that the system has a high precision in the detection of electronic document sensitive information.Besides,it can also realize the identification of process documents and sensitive documents with content changes.Meanwhile,the time-consuming of the system is analysed for its good testing performance verification.
Keywords/Search Tags:sensitive information, secret documents, content similarity, document fingerprint, detection strategy
PDF Full Text Request
Related items