Spam Filtering Based On Partial Least Squares

Posted on:2009-07-11

Degree:Master

Type:Thesis

Country:China

Candidate:P M Wang

Full Text:PDF

GTID:2178360272480745

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

The problem of unsolicited bulk e-mail, or spam, gets worse with every year. This development has stressed the need for automatic spam filters.Recently, the focus of spam filtering research has been put on machine learning for the automatic creation of personalized spam filters. In other words, spam filter has been regarded as one embranchment of text classification. The resulting spam filter has the advantage of being optimized for the e-mail distribution of the individual user.Up to now, there are many machine learning algorithms attempted to generate spam filters. But because the words appear in each mail are so sparse that when we deal with these dataset we can encounter high word dimensionalities and severe data-sparseness; Moreover, since there are quite a few thesaurus or similar content in different mail, the severe multi-collinearity of the words appear in one mail must be take in account. To deal with these problems, a new feature extraction method based on Partial Least Squares (PLS) is presented in this paper.The idea of new method is that firstly analyzing the relationship of mail's original features and corresponding sort, In order to figure out the maximization issue of covariance between them, original features are processing linear combination repeatly and extract new much less components, so the subspace constructed by new components can resolve the problems stated above. Finally, introducing cross-validity algorithm and make certain the extracted subspace's dimensions.The experiments on CEAS 2006 benchmark datasets (Enron-Spam datasets) show that promising results are reported after evaluated by TREC spam track and the new method performs better than feature selection byχ2 statistics.The main creatives of this paper are:(1) Proposing PLS feature extraction model for spam filtering research, it can effectively resolve the ubiquitous problems in mail dataset: high word dimensionalities, severe data-sparseness and multi-collinearity problems.(2) To improve filter efficiency, introducing cross-validity algorithm to make certain the extracted subspace's dimensions.

Keywords/Search Tags:

Partial Least Squares, Spam Filtering, Feature Extraction, Cross-validity

PDF Full Text Request

Related items

1	Spam Filtering Based On Kernel Paitial Least Squares Feature Extraction
2	Based On Fuzzy Partial Least Squares Feature Extraction Methods
3	Rearch On Content-Based Spam Filtering Technology
4	Research On Spam Filtering Based On Social Network
5	Research On Multi-layered Content-Based SPAM Filtering System
6	Research On Content-Based Spam Filtering
7	Research Of Partial Least Squares Classification Algorithm Based On SLT And Its Optimization Method
8	Chinese Spam Filtering Based On Cross Cover Algorithm
9	Application Of Bayesian Classification In Spam SMS Filtering
10	Research On Chinese Spam Filtering Method