Font Size: a A A

Research On Content-Based Spam Filtering Technology

Posted on:2010-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z LuFull Text:PDF
GTID:2178360278959216Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the computer and Internet booming, the communication through E-mail turns into a part of our daily life. Meanwhile, there are lots of spams following, which include kinds of contents such as reactionism, fraud, sale promotion, illicit sale, and so on. These spams not only disturb our ordinary communication, but also do harm to society. Latest study shows that, text is still the main chart of the spams, so the content-based spam filtering technology is still the main research areas of anti-spam field.The content-based spam filtering technology consists of mainly four parts, which are word cutting, feature selection, text presentation and classifying. In these areas, a lot of work has been done, and a large number of achievements have been made. In this paper principles of the four parts are analyzed, the algorithms of feature selection are studied, and based on the feature of spam filtering the algorithm of mutual information is improved.In this paper, on content-based spam filtering, development, applications and current status are briefly analyzed, the principle of the algorithm in every step is studied, and the application of mutual information in spam filtering is also mainly focused on, which is analyzed and improved based on the frequency, the divergence, and the concentration. The word frequency factor is added into the conventional mutual information algorithm, and the ratio of mutual information is used to judge the difference of mutuality of feature and every class. In the end, simulation testings, using real E-mail, on MATLAB, are conducted, and a classifying testing used naive bayes algorithm, on WEKA, is conducted.The results of the testing shows that, under the same ratio of dimension compression, the improved mutual information algorithm gets markedly higher precision and recall of spams, which means the improved algorithm can provide a better basis for the following spam classifying.
Keywords/Search Tags:Spam, Filtering, Feature Selection, Mutual Information, Bayes
PDF Full Text Request
Related items