Font Size: a A A

Feature selection strategies for spam e-mail filtering

Posted on:2007-04-06Degree:M.A.ScType:Thesis
University:Concordia University (Canada)Candidate:Wang, RenFull Text:PDF
GTID:2448390005476878Subject:Engineering
Abstract/Summary:
The spam e-mail (also known as junk e-mail) problem is rapidly becoming unmanageable. According to a recent European Union study, junk e-mails cost all of us about 9.4 billion (US) dollars per year, and many major ISPs say that spam adds about 20% to the cost of their service.; Feature selection is an important research problem in different text categorization applications including spam e-mail filtering. In designing spam filters, we often represent the e-mail by vector space model (VSM) in which every e-mail is considered as a vector of word terms. Since there are many different terms in the e-mail, and not all classifiers can handle such a high dimension, only the most powerful discriminatory terms should be considered. Also, some of these features may not be influential and might carry redundant information which may confuse the classifier. Thus, feature selection, and hence dimensionality reduction, is a crucial step to get the best out of the constructed features.; Many feature selection strategies (FSS) can be applied to produce the desired feature set. In this thesis, we investigate the use of several classifier-dependent feature selection strategies. We cast our feature selection problem as a 0-1 optimization problem and different optimization techniques are compared. These techniques include several local search optimization algorithms such as Hill Climbing, Simulated Annealing, Threshold Accepting and Tabu Search. We also examine some other algorithms inspired by biological systems and artificial life techniques such as Genetic Algorithm, Particle Swarm Optimization, Ant Colony Optimization and Artificial Immune Systems. The performance of all the above algorithms is compared with some traditional dimensionality reduction techniques such as Principle Component Analysis, Linear Discriminant Analysis and Singular Value Decomposition.; Our experimental results show that all these techniques can be used not only to reduce the dimensions of the e-mail VSM, but also improve the performance of the spam filter.
Keywords/Search Tags:E-mail, Spam, Feature selection, Problem
Related items