Font Size: a A A

The Researching Of Content Secarity Forbid Systems

Posted on:2008-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:H M LiuFull Text:PDF
GTID:2178360242977057Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the flooding of information on the Internet, harmful content such as reactionary Web pages, erotic Web pages and hate messages, is appeared. These scandalous pages are baleful comment on our nation, which are inconsistent with the facts, and carry bad weight on our people and on our youth, who are in the important phase to study philosophy and science knowledge. At the same time, the erotic Web pages are also bring bad function on our youth. To forbid these pages is necessary, and it is important to research relevant models and techniques for the Internet service support organization to block these ill Web documents.Content security now confronted the researchers, and in this paper, by using some relevant techniques such as machine learning, pattern recognizing, data mining, nature language understanding, information processing for Chinese, rough set theory and artificial intelligence for reference, efficient models and techniques are proposed to block these ill Web pages. The primary work of the author is introduced as follows:Firstly, the exiting techniques and systems of text filtering in our nation are analysized, the four common methods such as Platform for Internet Content Selection (PICS), URL blocking, keyword filtering, and intelligent content analysis is presented, and the intelligent content analysis is very necessary in-depth to filter the ill pages. As the first step of web page processing, the algorithm of getting the main text of Web page is brought forward, and the statistical algorithm of words quickly is also proposed.Secondly, text expressing techniques, text weights and weight normalization technique, the common feature selection techniques are researched. Different statistic methods need different text expressing techniques, VSM and SVM use vector space model to express text, while na?ve Bayes model use probability of words to express text. The experiment result suggest the three methods get good result on balance data set, but on non-balance data set, the result from na?ve Bayes model is bad than VSM and SVM, especially on the non-balance unseen testing data set. In real application, the data set is often non-balance data set, so our research result is very useful. And the normalization technique is very efficient in improving the precision, especially for non-balance data set.Thirdly, the concept of rough set theory is discussed; the essence of rough set theory is summarized. The experiment of rough set attribute reduct algorithm between discernibility function-based and attribute dependency-based is compared, the result suggest reduct algorithm based on discernibility function is hard to run than the reduct algorithm based on attribute dependency because the memory and time required. A hybrid method to select features more accurately using one of feature selection methods and rough set attribute reduct is proposed. We primarily use one of feature selection methods to select features primarily, next to further select features using rough set attribute reduct. Thus many noise and redundant attributes are deleted, and more accurate and few features are extracted. At last we use na?ve Bayes model to evaluate our feature selection method, the result shows our method has high precision and high recall, and is very effective and efficient.Fourthly, Feature selection is a very important step in text preprocessing, a good selected feature subset can get the same performance than using full features, at the same time, it reduced the learning time. In filter approach, the feature subset selection is performed as a preprocessing step to induction algorithms. But the filter approach is ineffective in dealing with the feature redundancy. In wrapper approach, the feature subset selection is"wrapped around"an induction algorithm, so its running time would make the wrapper approach infeasible in practice, especially for text data. Based on Rough set theory, a new feature selection method is proposed. It generate several ruducts, but the special point is that between these reducts there are no common attributes, so these attributes have more powerfully capability to classify new objects, especially for real data set in application. We choose two data sets to evaluate our feature selection method, one is a benchmark data set from UCI machine learning archive, and another is captured from Web. We use statistical classification methods to classify these objects, in the benchmark testing set, we get good precision with a single reduct, but in real date set, we get good precision with three reducts, and the data set is used in our system for topic-specific text filtering. Thus we conclude our method is very effective in application. In addition, we also conclude that VSM and SVM methods get better performance, while Na?ve Bayes method get poor performance with the same selected features.The end, an efficient topic-specific Web text filtering framework is proposed. This framework focuses on blocking some topic-specific Web text content. In this model, a hybrid feature selection method is used, and a high efficient filtering engine is designed. In training, we select features based on CHI statistic and rough set theory, then to construct filter with Vector Space Model. We train our framework with huge datasets, and the result suggests our framework is more effective for the topic-specific text filtering. This framework runs at server such as gateway, and it is more efficient than a client-based system. And a prefix email filtering system is proposed. Such filtering system is separated from the original web mail server, it control the mailing frequency of each SMTP client dynamically, and check the content of received Email in normal Chinese character encoding with the algorithm of DFSA. To take the filtering accuracy into account, this system will send the useful Emails, which are blocked by error, back to the mail server. In the end, a text-filtering platform is designed for the 863 plan of information security in training and practicing the personnel of text content security domain.
Keywords/Search Tags:content security, text filtering, feature selection, rough set, vector space model, support vector machine
PDF Full Text Request
Related items