Research On Chinese Spam Filering Technology Based On Content Mining

Posted on:2011-07-09

Degree:Master

Type:Thesis

Country:China

Candidate:L P Xu

Full Text:PDF

GTID:2178330332482014

Subject:E-commerce

Abstract/Summary:

PDF Full Text Request

With the rapid development of network and communication technology, E-mail (electronic mail) is becoming the most important communication way among the modern people. However, when people enjoy the convenience of e-mail, they also suffer the annoyance of spam. As the overflowing of junk email (spam) wastes network resources, damage users'personal profits and destroy the security and stability of the society, it is attracting attention and concern from the general public and researchers. Spam filtering technology has become one of the focuses of current research.Firstly, after a deep investigation of the domestic and international anti-spam literature and data, a systematic analysis on the background and current situation of the spam is given; Secondly, a further introduction of the closely related principle of SMTP protocols and e-mail is made and an analysis on the e-mail security flaws and the reasons of the overflowing of spam is also given; Lastly, on the basis of this research, a deep analysis of content-based spam filtering technologies is given, including the message extraction, Chinese word segmentation, feature value selection and text representation approach.As the disturbed information given by the spammers in order to avoid the filtering, the preprocessing of the plain text extraction is made and the dimension of feature words is pre-reduce before word segmentation, which effectively reduces the dimension of feature words, greatly improving the efficiency of the algorithm;Because there are some common and rare words whose contribution to the classification is very small, in the process of word segmentation, we add the removal of stop words and sparse words and give a new word segmentation algorithm process;In the feature value selection by the use of mutual information algorithm, as MI's shortcomings in spam filtering, in aspects of the frequency, concentration and negatively relation,we improve the traditional mutual information algorithm, adding the factor of the feature words'frequency in the document, finally, we proposes a "absolute difference" d(ti) to measure feature words'contribution to the classification. After all the d(ti) values are sorted as the sequence, we select the highest former K-dimensional values of d(ti) as the feature subset.To verify the effect of improved Chinese word segmentation, with improved and unimproved word segmentation algorithm, do the Chinese word segmentation experiments by the use of Wuhan University's ROSTContentMining software; After this, we use the mutual information algorithm to select the feature words in MATLAB based on a real e-mail set. Experimental results show that the values of improved mutual information algorithm d(ti) are distributed in different range, not concentrated in the vicinity of certain values, and these different mutual information values can play a greater role on the classification of categories; Finally, in the classification of E-mail, by the use of Bayesian classification and test options of Cross-validation Folds 10,we select the highest former K-dimensional values of d(ti) as the feature subset,then we do the classifier training and classification process in Weka platform. According to spam filtering system's evaluation, we verified the improved classification performance of the algorithm.

Keywords/Search Tags:

junk email, Bayesian, feature value selection, mutual information, Weka

PDF Full Text Request

Related items

1	Two Feature Selection Algorithms Based On Mutual Information And Bayesian Optimization
2	Research On Content-Based Chinese Junk Short Messages Classification Technology
3	Secure Email Server System
4	The Research Of Bayesian Classifier And Its Applications
5	Research On Feature Selection Algorithm Based On Mutual Information
6	Algorithm Based On Bayesian Filtering, Anti-spam Technology And Its Implementation
7	Research On Dynamic Feature Selection Algorithm Based On Mutual Information
8	Research On Mutual Information Based Feature Selection Method For High Dimensional Small Sample Data
9	Improvement On Mutual Information In Feature Selection Based On Composite Ratio
10	Research On Mutual Information Based Feature Selection Algorithm