Font Size: a A A

Research On Sensitive Words Detection Based On Hadoop In Network Security Audit

Posted on:2016-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:S HeFull Text:PDF
GTID:2298330452466428Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the popularity of the Internet, the information resources are more abundant in thenetwork. Meanwhile, more and more information which is illegal, bad and sensitive is also filledwith the network. The network becomes the main medium of information dissemination for feudalSuperstition, pornography, violence, reactionary speech, rumors myth and so on. Faced with thesethreats to network security, with the characteristics of real-time, dynamic and active defense, thesecurity audit provides a good security for the network.This paper combines with a practical network security audit system of a company, andmainly studies the sensitive detection technology in content audit. First, it introduces the conceptand research status of the sensitive detection and network security audit, as well as projects relatedtechniques. Based on the analysis of the system function, the overall implementation model of thesystem is given. The log data of the project is stored in XML format which contains both semanticinformation and structural information. Combining double-array trie and Dewey coding, the paperfocuses on the sensitive detection technology in XML documents. It also proposes the concept ofsensitivity and gives its calculation method. With the results of the research, a sensitive worddetection system prototype is finally designed and implemented, and it verifies the effectiveness ofthe methods and techniques in the research. In this paper, the main job has the following severalaspects.Analyze the functional requirements of the network information security audit systemand design the overall implementation model. Combined with the content audit, it analyzes theprocess based on the log audit, gives the format of the log and is clear about the object of thesensitive words detection technology research.The object of the data, which should been done with sensitive detection, is the log datais an XML format. In order to obtain the structural information and implement sensitive detectionwith complex structure, the paper studies the coding schema of XML documents based on Deweycoding. With Dewey coding, the coding of the parent node is directly as prefix encoding of itschild node. So it is easy to obtain the layer of each node and the structural relationships betweenthe nodes. It is helpful to calculate the structural sensitivity of the log. With the purpose of improving the efficiency of the sensitive words detection, indexshould been constructed for the sensitive thesaurus. The paper constructs the index bydouble-array trie, and studies the algorithm of sensitive words detection based on semantic andstructural information. On the one hand, according to the weight of the nodes and the frequencyrate of sensitive words, semantic sensitivity can be calculated and the calculation formula is givenin this paper. On the other hand, when the sensitive words contain structural information, sensitivedetection need combine the semantic information and the structural information. Do semanticinformation matching first, and then matching the structural similarity by calculating the distancebetween the sensitive words. After that the sensitive detection can be finished and the finalsensitivity can be measured.Combining with the technology of sensitive detection researched above, the paperdesigns and implements a sensitive word detection system prototype within the network securityaudit. The system is divided into four subsystems, user interface, information preparation,detection engine and audit strategy. The overall architecture of the system is designed. And theprocess of interaction is analyzed between the user and the system. On this basis, this paperintroduces the design and the implementation of each subsystem in detail. The algorithm ofDewey coding and double-array trie index structure detection are decomposed reasonably andrunning in the experimental Hadoop cluster environment, which enhances the scalability of thesystem to a certain degree.
Keywords/Search Tags:content audit, XML, Double-Array Trie, sensitive words, Hadoop
PDF Full Text Request
Related items