Font Size: a A A

Research On Multi-layered Content-Based SPAM Filtering System

Posted on:2010-07-31Degree:MasterType:Thesis
Country:ChinaCandidate:X XuFull Text:PDF
GTID:2178360275499966Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Electronic mail (E-mail) is becoming one of the fastest and most economical ways of the fastest and most economical ways of communication available. At the same time, the growing problem of junk mail (also referred to as "spam") has generated a need for e-mail filtering. Nowadays, anti-spam measures commonly include black or white list technology, manual rules and keyword based content filtering.Content-Based spam Filtering is using automated text categorization and information filtering to filter spam. An e-mail filtering system can learn directly from a user's mail set. Such algorithms of text categorization as Na(?)ve Bayes, KNN, Decision Tree and Boosting can be applied in spam filtering. However, the effectiveness of Na(?)ve Bayes is limited and it is not fit for instant feedback learning. Others algorithm are more effective but complicated to compute. Trying to resolve this problem, we propose using Naive Bayes and Winnow, a fast linear classifier. The training of Winnow is online and mistake driven. Furthermore, Winnow is suitable for feedback. The experiment in e-mail corpus shows an effective result.The contents of this article are as following:(1) We analyzed commonly used feature extraction methods, and put forward a based on word probability feature extraction methods. By dint of words confidence level parameter control feature extraction efficiency and precision, make it fit categorization algorithm.(2) Investigated Bayesian categorization method, we designed a MUA level filter algorithm. It can control filter sensitivity by words confidence level parameter and risk function.(3) Utilized winnow feedback learning efficiency, found linear classification Function for every user. And use it to filtering mail. Be used for a Bayesian spam filter in the mail do not have a strong characterization of a second filter, at the same time through the detection of user behavior to determine whether the false classification, and the basis for amendments to the classification function, To fit the requirements of personalized filters.(4) Designed a multi-layered content-based filtering system of the basic framework, as a spam filter prototype simulation system.
Keywords/Search Tags:Spam filtering, Feature extraction, Na(?)ve Bayes, Winnow
PDF Full Text Request
Related items