Font Size: a A A

Spam Filtering System

Posted on:2006-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z MaFull Text:PDF
GTID:2208360152966416Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With wide application of email, spams, acting as the carrier of business advertisements, the malicious programs or some sensitive mails, are more and more fiercely threatening the safety of the computer systems and the lives of people. Anti-spam problem has become an international, significant and practical topic now.The email filter is one of the key technologies of anti-spam. Nowadays, the email filter technologies have three ways. The first way is based on the contents. The second way is based on IP address. The third way is based on the head or envelope of the email. These technologies have more or less effects on filtering spams. But because the body of an email is the essential carrier of spam, so it is easy to make false judgment only based on IP address, the head of emails, or the envelope of emails.This paper puts forward a scheme of spam filter system, which works before mail server independently. It adopts Bayes algorithm to filter spams. And it calls all kinds of characters of spams as "attributes", uses all these "attributes" forming the eigenvector to reflect spams' characters. Thus it can avoid the shortcoming of strong rules, which is only based on IP address, the head of emails, or the envelope of emails.To enhance the capabilities of the system, this paper studies all kinds of technologies of spam filter system, which include Chinese word segmentation technology, the dictionary mechanism, automated text classification technology, and so on. With analysis of the Chinese word segmentation technologies, this paper combines the leftward increase maximum matching algorithm and the rightward decrease minimum matching algorithm to segment the text, and adopts Mutual Information to eliminate different meanings, thus the precision of Chinese word segmentation is enhanced. Aiming at the current dictionary mechanism, this paper put forward an improved PATRICIA-tree-based dictionary mechanism for Chinese word segmentation, which can improve the speed and efficiency of segmentation, but decrease the space complication, and reduce the build and maintenance difficulty. By comparing all kinds of feature selection functions, this paper adopt Expected-Cross-Entropy to select the features, this can form a base for exact text classification. This paper analyze two ways to improve the Bayes algorithms, and pointe out that the essentials of these two ways are same, so this paper adopt the improved Bayes algorithm to reduce the risk of false judgement.
Keywords/Search Tags:email, spam, email filter
PDF Full Text Request
Related items