Font Size: a A A

Research On Online Learning Based Spam Filtering

Posted on:2013-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y W ShenFull Text:PDF
GTID:2248330395486730Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Email provides lots of convenience to people’s life and work. But a mass ofspam greatly affects the use of email. Spam occupies too much network resource andharms the interests of users. Some people who have ulterior motives use it to spreadfalse news. So spam filtering is hot problem in current research.This paper studies spam filtering based on machine learning methods. Thesetypes of methods which have the features of high accuracy and low cost, havebecome the mainstream methods to tackle spam filtering. This paper is mainlydivided into four parts.Firstly, we study the framework and filtering model of spam filtering based ononline learning. We realize three spam filters which respectively utilize Naive Bayes,Support Vector Machines and Logistic Regression. Their advantages anddisadvantages are compared in aspect of CPU time and accuracy.Secondly, we study the feature engineering in spam filtering, including featureextraction and feature selection. In feature extraction, we introduce words-basedmethod and n-grams method with bytes level. In feature selection, Information Gainand Bayesian statistics methods are proposed to reduce computational cost andimprove the filter performance a little. At the same time, we suggest that spamfiltering can be treated as an online ranking task. Online ranking logistic regressionmethod is presented to settle spam filtering.Finally, we show that noisy data sets harm or even break state-of-the-art spamfilters. The spam filter based on machine learning methods attains near-perfectperformance when filters are given accurate labeling feedback for training. However,users perhaps give incorrect feedbacks in real-world settings. The noisy data sets arecreated and used to analyze the changes of the filtering performance with the numberof noisy emails.
Keywords/Search Tags:spam filtering, online learning, feature selection, ranking learning, noisyuser feedback
PDF Full Text Request
Related items