Font Size: a A A

Research On Feature Selection Algorithm Of Spam Filtering

Posted on:2017-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:M LiFull Text:PDF
GTID:2348330512455964Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of the Internet, e-mail has become an important mean of communication in people's daily life. But with the development of e-mails, spam has appeared. The appearing of spam has not only brought trouble to the large number of users, but also criminals are starting to use it to promote illegal information. In this background, this paper is studying the content-based anti-spam technology and analyzing the shortcomings of two traditional feature selection algorithms, proposing two new feature selection algorithms, and conducted a verification experiment to prove the effectiveness of the proposed algorithms.This paper analyzes the shortcomings and deficiencies of traditional information gain and mutual information algorithms, for both proposing improvement program, as follows:1. Traditional information gain feature selection algorithm measures the degree of association between a feature item and category, but no analyzing the degree of concentration and the degree of dispersion of a feature in or between classes. So in the base of traditional information gain, proposing the concept of intra-class classification and concentration in class, and improving the traditional feature selection algorithm. During the experiment, using Bayes and support vector machine classifications on five datasets, by comparing the recall rate, precision rate, AUC values, and F1 performance evaluation criteria, it concluded that the proposed improvement scheme is better than the information gain, chi-square statistic and mutual information algorithm.2. The traditional mutual information algorithms measures a correlation between feature items and categories, but only considering the positive correlation, not taking into account the negative correlation between them; and did not shield rare features. Taking all these factors, we propose an improved algorithm, one hand shielding rare feature, on the other hand considering the positive and negative correlation between feature items and categories. Similar to the first improved algorithm, using different classifiers on different data sets, comparing four kinds of evaluation criteria for each classification, obtaining experimental results, proving the proposed improvement scheme is better than the information gain, chi-square statistic and mutual information algorithms.Although it has proved that the performance of two algorithms proposed in this paper is better than the traditional feature selection algorithms, but the algorithm proposed in some of the data collection showed instability, will be the next focus of this study; additional spam test samples of this study are all the plain text data, and now criminals to evade spam filtering mechanism, has begun to send a large amount of image spams, how to effectively identify and block image spams, will be another focus of this next study.
Keywords/Search Tags:Text Classification, Spam, Feature Selection, Information Gain, Mutual Information
PDF Full Text Request
Related items