Font Size: a A A

Spam Filtering Based On Partial Least Squares

Posted on:2009-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:P M WangFull Text:PDF
GTID:2178360272480745Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The problem of unsolicited bulk e-mail, or spam, gets worse with every year. This development has stressed the need for automatic spam filters.Recently, the focus of spam filtering research has been put on machine learning for the automatic creation of personalized spam filters. In other words, spam filter has been regarded as one embranchment of text classification. The resulting spam filter has the advantage of being optimized for the e-mail distribution of the individual user.Up to now, there are many machine learning algorithms attempted to generate spam filters. But because the words appear in each mail are so sparse that when we deal with these dataset we can encounter high word dimensionalities and severe data-sparseness; Moreover, since there are quite a few thesaurus or similar content in different mail, the severe multi-collinearity of the words appear in one mail must be take in account. To deal with these problems, a new feature extraction method based on Partial Least Squares (PLS) is presented in this paper.The idea of new method is that firstly analyzing the relationship of mail's original features and corresponding sort, In order to figure out the maximization issue of covariance between them, original features are processing linear combination repeatly and extract new much less components, so the subspace constructed by new components can resolve the problems stated above. Finally, introducing cross-validity algorithm and make certain the extracted subspace's dimensions.The experiments on CEAS 2006 benchmark datasets (Enron-Spam datasets) show that promising results are reported after evaluated by TREC spam track and the new method performs better than feature selection byχ2 statistics.The main creatives of this paper are:(1) Proposing PLS feature extraction model for spam filtering research, it can effectively resolve the ubiquitous problems in mail dataset: high word dimensionalities, severe data-sparseness and multi-collinearity problems.(2) To improve filter efficiency, introducing cross-validity algorithm to make certain the extracted subspace's dimensions.
Keywords/Search Tags:Partial Least Squares, Spam Filtering, Feature Extraction, Cross-validity
PDF Full Text Request
Related items