Font Size: a A A

The Information Leakage Detection Based On Text Information Extraction

Posted on:2016-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:H Y XieFull Text:PDF
GTID:2308330464969013Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, people have entered the information age. In the digital environment, electronic documents and emails are used more and more widely, so the probability for leakage of sensitive information is higher and the protection of information is becoming more and more important. The leakage of sensitive information can be divided into single document sensitive information leakage and multi-document associated sensitive information leakage. The detection for sensitive information leakage is relatively simple but the detection for multi-document associated sensitive information leakage is mainly depended on artificial method which is inefficiency and risky. Based on analyses above,this paper proposes a kind of detection that based on the extraction of text information in order to detect whether single document or multiple-document uncork sensitive information or not.With email body, email attachments, paper and documents as testing data, this paper adopts the information extraction method to extract the entity information and then detect the sensitive information in the document data by the association of entities. By analyzing the email body, paper and documents types and because each type of documents has their own characteristics, this paper adopts different extraction methods for different kinds of documents. For entity information in the email body, this paper uses regular expression to extract entity information; for entity information in the papers and documents, firstly, parsing formatted documents in the papers and documents. Secondly, because the papers and documents have regular format, this paper extracts the format information after parsing formatted documents in the papers and documents, then transforms the unstructured data into structured data. Last, this paper extracts entity information from the structured data. According to the classification of the entity information, this paper divides association of entity information into the association of email body entity information and email attachments entity information, the association of the email body entity information and paper entity information, the association of document entity information. By multiple association of entity information, this can detect whether there is a leakage of sensitive information.Finally, based on the system analysis this paper determines the framework of system and completes the detection system for information leakage. The whole system will extract format information and entity information for tested data, and eventually detect whether there is a leakage of sensitive information by the association of entity information.
Keywords/Search Tags:Information Extraction, Association of Entity Information, Format Information, Entity Information, Information Disclosure
PDF Full Text Request
Related items