Font Size: a A A

Research And Implementation Of Text Filtering System Based Of Network Information Audit

Posted on:2011-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhouFull Text:PDF
GTID:2178360305476434Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, it is convenient for people to share information. However, violent, pornographic, reactionary and other illegal information also spread on the Internet. Therefore, it becomes an important and hot research topic to monitor all information on the Internet and filter illegal information. On the research of network information audit techniques and text filtering, this paper designs and implements a Web text filtering system based on network information audit.Firstly, we implement a real-time Web page filtering module based on IP Queue which running on the IP Control Gateway. This module captures network packets based on IP Queue mechanism provided by Netfilter/iptables framework in Linux which monitoring HTTP packets flowing through gateway at real time. It includes two sub-modules: HTTP request packets filtering and HTTP response packets filtering. The request packets filtering module analyses IP and URL, achieves high filtering efficiency using black and white list. The response packets filtering module verifies whether the web page contains illegal keywords.Secondly, this paper presents a hierarchical text filtering approach based on bigram in the off-line filtering module. We extract illegal keywords using the method combined document frequency and chi-square in training texts. According to our strategy, we filter out the illegal texts and some legal texts which contain illegal keywords. In those texts, we select bigram as feature after Chinese word segmentation, deleting stop word and other process. According to the features, we finish the vectorization of texts, and then use support vector machine classifier to distinguish texts and filter illegal texts.Thirdly, this paper proposes a new method to extract bigram as features based on illegal keywords. It extracts a bigram that contains illegal keywords as candidate feature in a certain size of extraction window which is based on the context of an illegal keyword, and then uses chi-square to estimate those candidate features and selects a certain number of best features as final features. This method preserves the strong power of bigram to distinguish class, while reduces the extent of data sparseness.Finally, we implement a complete Web page filtering system under experimental environment. The real-time filtering and the offline analysis are combined closely. Experimental results show that our system not only meets the real-time requirements, but also ensures a high accuracy.
Keywords/Search Tags:Network Information Audit, Text Filtering, Real-Time Filtering, Bigram, Extraction Window
PDF Full Text Request
Related items