Font Size: a A A

Research On Content-Based Spam Filtering

Posted on:2006-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:S WangFull Text:PDF
GTID:2178360185995507Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Electronic mail (e-mail) is a big success of Internet, it is becoming one of the fastest and most economical ways of communication available. At the same time, the growing problem of junk mail (also referred to as "spam") has generated a need for e-mail filtering. There have been a lot of methods to beat spam, and the approach of using automated text categorization and information filtering to filter spam is become a most efficient one. We analyzed the currect technology of content-based spam filtering, and found lots of differences between the traditional text categorization Problem and the one of spam filtering. Depend on these analysis, we develop some methods to improve the performance of the spam filtering algorithm.The contents of this article are as following:(1) A summary about the state of the content-based spam filtering.We investigating anti-spam problem from the text categorization perspective, introducing the approaches of feature selection, classfiers and e-mail corpus in this task.(2) Compare to the object of the text-mining problem, e-mail has much more kinds of information. Through a lot of investigation, we categorize all the features that can be used in content-based spam filtering. And we give a thorough investigation of "attribute-feature", which has not been researched much.(3) There are lots of differences between the problem of common data mining and spam filtering. The feature distributions vary a lot, when the structures are very different between two email corpus. And the diversity of the feature distributions has affect on the performance of ML algorithm. We analyze the problem mentioned above, and also provide a structure based 2-layers filtering model, which uses different machine learning filter to train and classify mail of different structure. Experiments show that ML algorithm's performance improves a lot after using this model.(4) Although content-based spam filtering methods can get prominent performance, they have not been used sufficiently. We analyze the result, and develope one model that move the place of the client filter to the mail sever.
Keywords/Search Tags:spam filtering, text categorization, machine learning, data mining, information retrieval
PDF Full Text Request
Related items