Font Size: a A A

The Research Of Spam E-mail Filtering Technology

Posted on:2007-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:M TangFull Text:PDF
GTID:2178360182487061Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the technology of e-mail has become one of the most economical ways of communication available, most secret letters are also sent by e-mail. However, Spam is an ongoing problem on the Internet when e-mail is developing. Spam causes the great waste of user's time, energy as well as network bandwidth and storage space. Therefore spam filtering is a subject with great realistic meaning.Nowadays, many means can be applied to resolve the problem of spam filtering. Content-based e-mail filtering is one of the mainstream technologies used so far. At present, the majority content-based e-mail filtering use Naive Bayes algorithm (NB). But NB's precision and recall are lower compared with Support Vector Machine (SVM).SVM is a novel kind of machine learning method from Statistical Theory. And it has been shown to provide higher superiority than other learning machines, such as structural risk minimization, global optimal solution which is unobtainable by using other machine learning methods ,very good effect when training sample are in the non-linearity and high dimension patterns. But there are many problems to classify e-mail if the classifier is traditional support vector machine. Training time are long and the robustness of the accuracy and the precision are not best in alarge-scale operation. That legit emails are collected are more difficult than collecting spam. So legit email maybe be fully assigned to spam by traditional SVM and classified function is near to the less. Some input points may not be exactly assigned to classes when traditional SVMs classify input points so that SVM cannot separate these points more correctly. Some data points corrupted by noises are less meaningful and the precisions are dropped.In this paper, to overcome these problems, we propose two improving methods of SVM which are Hierarchical and Parallel Support Vector Machine and Fuzzy Support Vector Machine on email filtering.The research of the paper could be embodied in two respects:· This dissertation present Hierarchical Parallel SVM algorithm to anti-spam. Firstly, the entire classification problem is divided into several small subproblems that can be handled in a parallel way. Having hierarchically filtered out the non-support vector data we can obtain the final training data set, which is used to train the final SVM. The cross-combining principle is introduced in filtering to reduce train time and keep the capability of classifiers. Furthermore, the cross-combining principle can reduce the costs due to the imbalance of the numbers between two classes .And this paper deduced principal component analysis method which improved the efficiency of dealing with email feature extraction. Experiments show that HPSVM algorithm not only speeds up the training but also obtain better precision and recall.· This dissertation present a novel anti-spam e-mail algorithm based Fuzzy Support Vector Machine with misclassification costs. FSVM solve the case that input points may not be exactly assigned to classes. C is invariable in traditional SVM, i.e., legitimate email and spam are dealed with equally in e-mail filtering. But the loss of legitimate email is much more serious. So we proposed that misclassification costs to fuzzy support vector machine in order to pass all legitimate emails. At the same timethis paper calculates density of distribution of the samples as fuzzy membership. This method can prevent noisy data points for the effect Experiments show that this algorithm improves the precision and the capability of email classifier.
Keywords/Search Tags:Spam, Principal Component Analysis, Support Vector Machine, Fuzzy Support Vector Machine, Misclassification
PDF Full Text Request
Related items