Font Size: a A A

Research And Design Of Content-Based Spam Filter

Posted on:2007-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:S J LiFull Text:PDF
GTID:2178360182483259Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
Emails have gradually become a major communication means with the features of swiftness, convenience and low cost through the quick popularization of Internet. But along with that is the flood of junk emails, normally called spam. Spam takes up limited storage, calculation and network resources, and wastes much time for the handling, which has greatly disturbed users' normal work, study and life. How to effectively manage the spam is now an urgent problem of Internet as well as a common difficulty facing the world.Based upon a deep investigation on large number of latest spam samples, the spammers' common spurious methods are summarized. Through the reference to large amount of anti-spam documents and data from home and abroad, an analysis is made on existing anti-spam techniques and in particular the content-based spam filtering methods. For the extensively used Naive Bayes algorithm in spam filtering field, an improved method is proposed which reduces false negatives or positives of legitimate emails, presenting better performance in spam categorization and filtering. Also in this paper, a research on text replication check techniques is made on spam filtering, upon which the Nilsimsa algorithm is implemented. In the end, a filtering method based on URLs is proposed for spam containing large amount of hyper links in HTML forms. The experiment results in good approval of this method which identifies spam difficult for content-based filters, providing an effective supplement for content-based filtering.Chapter 1 introduces the anti-spam research background, the spam definition, history and composition, and the architecture of this paper. Chapter 2 tells the principle and protocols of email system. The analysis and comparison of the main three anti-spam techniques is made in Chapter 3 with key importance on spam filtering techniques. The common spurious means used by current spammers are also summarized in this chapter. Chapter 4 explains the Naive Bayes algorithm and its application in spam filtering field. And for its deficit, an improved scheme, i. e. the risk minimization Bayes algorithm is presented with comparison on the two's performance. The implementation of the Nilsimsa algorithm for similar emails check and the URL-based filter is also achieved in this chapter with result analyses. Chapter 5 summarizes the whole paper and points out the future work direction.
Keywords/Search Tags:spam, Bayes, risk minimization, replication check, URL filter
PDF Full Text Request
Related items