Font Size: a A A

Research On Chinese Spam Filering Technology Based On Content Mining

Posted on:2011-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:L P XuFull Text:PDF
GTID:2178330332482014Subject:E-commerce
Abstract/Summary:PDF Full Text Request
With the rapid development of network and communication technology, E-mail (electronic mail) is becoming the most important communication way among the modern people. However, when people enjoy the convenience of e-mail, they also suffer the annoyance of spam. As the overflowing of junk email (spam) wastes network resources, damage users'personal profits and destroy the security and stability of the society, it is attracting attention and concern from the general public and researchers. Spam filtering technology has become one of the focuses of current research.Firstly, after a deep investigation of the domestic and international anti-spam literature and data, a systematic analysis on the background and current situation of the spam is given; Secondly, a further introduction of the closely related principle of SMTP protocols and e-mail is made and an analysis on the e-mail security flaws and the reasons of the overflowing of spam is also given; Lastly, on the basis of this research, a deep analysis of content-based spam filtering technologies is given, including the message extraction, Chinese word segmentation, feature value selection and text representation approach.As the disturbed information given by the spammers in order to avoid the filtering, the preprocessing of the plain text extraction is made and the dimension of feature words is pre-reduce before word segmentation, which effectively reduces the dimension of feature words, greatly improving the efficiency of the algorithm;Because there are some common and rare words whose contribution to the classification is very small, in the process of word segmentation, we add the removal of stop words and sparse words and give a new word segmentation algorithm process;In the feature value selection by the use of mutual information algorithm, as MI's shortcomings in spam filtering, in aspects of the frequency, concentration and negatively relation,we improve the traditional mutual information algorithm, adding the factor of the feature words'frequency in the document, finally, we proposes a "absolute difference" d(ti) to measure feature words'contribution to the classification. After all the d(ti) values are sorted as the sequence, we select the highest former K-dimensional values of d(ti) as the feature subset.To verify the effect of improved Chinese word segmentation, with improved and unimproved word segmentation algorithm, do the Chinese word segmentation experiments by the use of Wuhan University's ROSTContentMining software; After this, we use the mutual information algorithm to select the feature words in MATLAB based on a real e-mail set. Experimental results show that the values of improved mutual information algorithm d(ti) are distributed in different range, not concentrated in the vicinity of certain values, and these different mutual information values can play a greater role on the classification of categories; Finally, in the classification of E-mail, by the use of Bayesian classification and test options of Cross-validation Folds 10,we select the highest former K-dimensional values of d(ti) as the feature subset,then we do the classifier training and classification process in Weka platform. According to spam filtering system's evaluation, we verified the improved classification performance of the algorithm.
Keywords/Search Tags:junk email, Bayesian, feature value selection, mutual information, Weka
PDF Full Text Request
Related items