Research On Feature Selection Algorithm Of Spam Filtering

Posted on:2017-06-22

Degree:Master

Type:Thesis

Country:China

Candidate:M Li

Full Text:PDF

GTID:2348330512455964

Subject:Computer application technology

Abstract/Summary:

In recent years, with the rapid development of the Internet, e-mail has become an important mean of communication in people’s daily life. But with the development of e-mails, spam has appeared. The appearing of spam has not only brought trouble to the large number of users, but also criminals are starting to use it to promote illegal information. In this background, this paper is studying the content-based anti-spam technology and analyzing the shortcomings of two traditional feature selection algorithms, proposing two new feature selection algorithms, and conducted a verification experiment to prove the effectiveness of the proposed algorithms.This paper analyzes the shortcomings and deficiencies of traditional information gain and mutual information algorithms, for both proposing improvement program, as follows:1. Traditional information gain feature selection algorithm measures the degree of association between a feature item and category, but no analyzing the degree of concentration and the degree of dispersion of a feature in or between classes. So in the base of traditional information gain, proposing the concept of intra-class classification and concentration in class, and improving the traditional feature selection algorithm. During the experiment, using Bayes and support vector machine classifications on five datasets, by comparing the recall rate, precision rate, AUC values, and F1 performance evaluation criteria, it concluded that the proposed improvement scheme is better than the information gain, chi-square statistic and mutual information algorithm.2. The traditional mutual information algorithms measures a correlation between feature items and categories, but only considering the positive correlation, not taking into account the negative correlation between them; and did not shield rare features. Taking all these factors, we propose an improved algorithm, one hand shielding rare feature, on the other hand considering the positive and negative correlation between feature items and categories. Similar to the first improved algorithm, using different classifiers on different data sets, comparing four kinds of evaluation criteria for each classification, obtaining experimental results, proving the proposed improvement scheme is better than the information gain, chi-square statistic and mutual information algorithms.Although it has proved that the performance of two algorithms proposed in this paper is better than the traditional feature selection algorithms, but the algorithm proposed in some of the data collection showed instability, will be the next focus of this study; additional spam test samples of this study are all the plain text data, and now criminals to evade spam filtering mechanism, has begun to send a large amount of image spams, how to effectively identify and block image spams, will be another focus of this next study.

Keywords/Search Tags:

Text Classification, Spam, Feature Selection, Information Gain, Mutual Information

Related items

1	The Research And Implementation Of Chinese Text Classification Based On Feature Selection And LDA
2	Research And Improvement Of Feature Selection Algorithm In Text Classification
3	Improvement On Mutual Information In Feature Selection Based On Composite Ratio
4	Study Of Mutual Information Feature Selection In Chinese Text Classification
5	Analysis And Study On Feature Selection Method In Chinese Text Categorization
6	The Research Of Feature Selection Method In Text Classification Based On Triple-Play
7	Research Of Chinese Text Classification Algorithms Based On VSM
8	Research Of Feature Selection For Text Classification
9	Research On The Algorithm Of Feature Selection Based On Mutual Information For Text Categorization
10	Research On Text Feature Selection And Classification Algorithms