Font Size: a A A

The Design And Implementation Of Anti-Spam Engine Based-on Winnow

Posted on:2007-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2178360212965549Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Email has become an important means of communication and an important component of the enterprise operation, however as a byproduct of Email, spam is increasingly affecting people's daily lives, and anti-spam has become a major global research topic.Among the various techniques to solve spam problems, begin with the mail content, and use text categorization technique, then study and construct classification on training sets, at last test the system performance using test sets is a commonly used method.Recently the spam center begins to move to China, so the research of Chinese anti-spam is important. This dissertation focus on Chinese spam. Various techniques required in anti-spam engine were discussed and the system module were designed, including pre-process module, training module, classify module and feedback module and the implementation of part important modules were given. At last the system of Chinese anti-spam engine based on Winnow algorithm was realized.Specifically, this article includes the following main parts:1) Pre-process module include mail decode and Chinese word segmentation. A detailed Base64 and QP encoding standards and decode algorithms were given in the mail decode module and in order to better maintaining the dictionary, an improved full binary search maximal match algorithm for Chinese word segmentation was adopted in the Chinese word segmentation module.2) Winnow algorithm was adopted to construct the classification in the training module. The training of Winnow is online and mistake driven. Furthermore, Winnow is suitable for feedback. Both positive Winnow and balanced Winnow were realized in this system and after test balanced Winnow was found superior to positive Winnow.3) In classify module, it was found that the original method to set threshold value would lead to too low Recall, after the threshold adjusted, the effects would be significant increase in system, at last the way to adjust threshold value was given.
Keywords/Search Tags:Spam mail, Mail decoding, Chinese word Segmentation, Feature extract, Winnow, VSM, Feedback
PDF Full Text Request
Related items