Font Size: a A A

Research On Chinese Spam Filtering Technology

Posted on:2012-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhangFull Text:PDF
GTID:2178330335452391Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, Internet users are increasing, so, e-mail has become one of most common and economical service of Internet in the modern society. Because of its efficient and economical characteristics, many large enterprises use it as the main inner communication. But it is has drawback of that some people use this feature to spread spam for their own benefits. That is not only occupying bandwidth, consuming network resources, wasting user's time, but also posing a threat to the user's computer and disclosing user's privacy. Then the anti-spam technology came into being. International and domestic researchers are looking for effective spam filtering technology from the born of anti-spam technology.Based on predecessors' studies, this thesis in-depth study of the current mainstream anti-spam technology at home and abroad, and focus on content-based spam filtering technology to avoid the influence of spam. Thought compared several commonly used content-based filtering methods, we design the Bayesian classification algorithm to solve the problem. Firstly, we present a weight-based Bayesian classification model. An important feature of our model is that we explicitly consider the concept of the text information gain to the traditional weight calculation, optimize the weight formulation. Secondly, considering the practical application, this thesis improves the judgment of spam from Comparison of probability size to probability quotient. Thirdly, we consider the proportion of the legitimate e-mails and junk e-mails in the training sample set of e-mails. According to the "China Anti-Spam Survey Report" recently released, in which Statistics the proportion of spam in all e-mails that user received, we try to simulate the real proportion in the training sample set of e-mails. Then, the experiment proved that the accuracy of the improved Bayesian algorithm is improved, compared to the traditional algorithm,.In order to lay the foundation for the design of the Classification Module, we also study some technologies related e-mail filtering, such as Chinese word segmentation, text representation model, feature selection and so on. Finally, we design a complete model of multi-layer spam filtering system. The system combines a variety of spam filtering techniques, including black and white list filtering, rule-based filtering and Bayesian filtering. And using improved algorithm to implement the Bayesian classification module.
Keywords/Search Tags:Chinese e-mail, E-mail filtering, Bayesian classification, Chinese word segmentation, weight
PDF Full Text Request
Related items