Spam Filtering Technology Research

Posted on:2007-11-11

Degree:Master

Type:Thesis

Country:China

Candidate:Q Lin

Full Text:PDF

GTID:2208360212977612

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Electronic mail (e-mail) is becoming one of the fastest and most economical ways of communication available. At the same time, the growing problem of junk mail has generated a need for e-mail filtering. Nowadays, anti-spam measures commonly include black or white list technology, manual rules and keyword based content filtering.Another approach is using automated text categorization and information filtering to filter spam. An e-mail filtering system can learn directly from a user's mail set. Such algorithms of text categorization as Na?ve Bayes, kNN, Decision Tree and Boosting can be applied in spam filtering. Naive Bayes is the most popular filtering algorithm. However, the effectiveness of Na?ve Bayes is limited. Others algorithm are more effective but complicated to compute.Because of the widely use of the Bayes filters, spam senders have found some special ways to get their spam mails out from the filter. One of the ordinary ways is inserting white keywords.We present an anti-filtering system that use white keywords insertion method, and did some experiments on it to research the robustness of Bayesian filter. The results show that the performance of Bayesian filter is weak in robustness.Trying to resolve this problem, we propose use a pattern-discovery based bayesian filter. The pattern-discovery module works using TEIRESIAS algorithm, which is a pattern discovering algorithm that can quickly discover unknown patterns that appear two or more times in a large corpus. It capitalizes on the earlier pattern discovery work on problems from computational biology. In 2004, IBM applied the algorithm to Anti-SPAM field and shows an effective result. We present a filtering algorithm that combines the Teiresias and Bayesian. Experiments show that the algorithm can achieve a high rate of identification without deteriorating the robustness.The contents of this paper are as following:1) An overview about the state of the art of the spam filtering.2) An introduction to the rule-based filtering algorithms3) Investigating anti-spam problem from the text categorization perspective, introducing the approaches of feature selection, classifiers and e-mail corpus in this task.4) Presents an anti-filtering system that use white keywords insertion method, and did some experiments on it to research the robustness of bayesian filter.5) Presents a filtering algorithm that combines the Teiresias and Bayesian and tests the ability and robustness of the filter.

Keywords/Search Tags:

spam filtering, Teiresias, anti-filtering algorithm

PDF Full Text Request

Related items

1	Research And Implementation Of A Three-Dimensional Hybrid Spam Filtering Method
2	Spam Filtering Method And System Realization
3	Research And Implementation Of Spam Filtering System Based On The Sender Abnormal Behavior Detection
4	Chinese Anti-spam Filtering System Development And Research
5	A Spam Hybrid Filtering Technology Research
6	SVM-Based Novel Method Of Online Spam Filtering
7	Algorithm Based On Bayesian Filtering, Anti-spam Technology And Its Implementation
8	The Design And Implementation Of Content-Based Anti-Spam Email System
9	Adaptive anti-spam e-mail filtering using Huffman coding and statistical learning
10	Research On Auto-learning Anti-spam Services With No-labeled