Font Size: a A A

A Fast Method To Train Imbalanced Email Set Based On Clustering

Posted on:2014-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:C Z YangFull Text:PDF
GTID:2248330395495456Subject:Information management projects
Abstract/Summary:PDF Full Text Request
With the Internet flooded with spam emails, research on email filtering has become a hot topic. The problem of filtering emails is usually considered as a classification problem for normal emails and spam emails by text mining methods, which can recognize an email by a model learned from labeled training email set. However, in the real world, spam emails are produced at all times. As a result, the sample set is always of huge amount and has to be updated frequently, which means a great resource consuming during the training procedure. Meanwhile, in consideration of the privacy of users, a normal email is much more difficult to collect than a spam one, which makes the training set biased towards the spam and has a negative impact on the accuracy of emails filtering results. Thus the authors propose a method to train imbalanced email set via clustering way, which uses support vector machine to train and predict emails after compressing and balancing the training set by clustering methods. Experiments show that this method has a good performance in saving the training time and improving accuracy.
Keywords/Search Tags:emails filter, fast training, imbalanced data, clustering, training setshrink
PDF Full Text Request
Related items