Font Size: a A A

Analysis And Filtering Of Spam E-mail

Posted on:2009-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:X H HuFull Text:PDF
GTID:2178360242974978Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the extensive application of Internet, Electronic mail has become the most economical and popularly applicated form of communication available. It provides us great convenient.However, at the same time,spam E-mails make a lot of trouble to numerous users,network administrators and ISPs as its byproducts.Today,the growing problem of spam E-mails has generated a need for E-mail filtering. The anti-spam problem has become an international, significant and practical topic now. Anti-spam Filtering System mainly adopts the technology of their own rules ,since content-based spam filtering technology is still immature,therefore,the effect of spam filtering is not ideal.In order to filter Chinese spam E-mail effectly,we launch the development and research of anti-spam filtering.At first, on the base of the work principle of E-mails, this paper studys on the methods of spam E-mail filtering,exploits the theory of text categorization .Then,we put forward projects to filter spam E-mails by using two algorithms in data mining,KNN(K-Nearest Neighbor )and RBF NN(Radial Basis Function Neural Network).Pre-processing will be conducted before filtering E-mail samples. First,the maximum matching method is used to segment E-mail texts,and features are extracted as the following. After that ,due to reduce vector dimension, Mutual Information and Odds Ratio are used to select a part of features from a mass ones, which is more useful to E-mail category. At last, the weights of features are calculated ,and the E-mail texts are presented in vector space model, then, the E-mail sample database is constructed. We propose two improved schemes, due to the high complexity cost of KNN. The proposed scheme is effective, and the improvement for the existing method based on KNN has low complexity cost but satisfied classification accuracy of spam filtering .RBF NN is more suitable for system recognition because its outputs are linear to the weights. It is a new attempt of spam E-mail filter.
Keywords/Search Tags:Spam, E-mail Filter, Text Categorization, Data Minging, KNN, RBF
PDF Full Text Request
Related items