Font Size: a A A

Content-based Spam Filtering Technology

Posted on:2006-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:L HuFull Text:PDF
GTID:2208360155965068Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Electronic mail (E-mail) is becoming the most important communication way among the modern people with the computer and communication technology becoming advanced and people requiring informational society. But, at the same time, the growing problem of junk mail (also referred to as "spam") has become graver. Now the most users of E-mail have received more spams than the available mails.Today, many means can be applied to resolve the problem of spam. Content-based spam filtering is one of the mainstream technologies used so far. The approach is using automated text categorization and information filtering to get spam. It include two asides: rule-based approach and statistic-based approach.The representations of statistic-based approach include Bayes and SVM (support vector machine) and so on. The computation of Bayes approach is simple but it is far from effective on the value of Recall and Precision. Therefore the article chooses SVM approach to resolve the problem of text categorization in order to get a well value of Recall and Precision. The article introduced the principle of E-mail system, the approach of spam filtering, the public e-mail corpus and the evaluation of spam filtering system. Finally, a measure using SVM has been found to filter spam, and through active learning method to solve the problem of the excessive expenses caused by obtaining the example in the machine learning.Experiment shows, compared with the general SVM, the active learning method can reduce the number of example effectively on the premise of keeping correctness of the classifier, and compared with the Bayes approach, the SVM approach can get better index in Recall and Precision.Finally, the article gives a realization of spam filtering system. However, much more work should be done in order that the filter can be used in practice.
Keywords/Search Tags:Spam, MTA, MDA, content-based, corpus, Bayes, active learning, SVM (support vector machine)
PDF Full Text Request
Related items