Font Size: a A A

The Study On Time Series Feature Based Bayesian Spam Filtering

Posted on:2013-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:C L ShangFull Text:PDF
GTID:2248330395475446Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet, e-mail has rapidly become an effective approachfor business and communication and brings opportunities for many undesirables. At the sametime,this bring the opportunity to lawbreakers. There are a lot of spam which results in atremendous waste of resource. So far, in a variety of anti-spam techniques, due to the betterperformance of classifying, Bayes algorithm has become the major trend and being widelyused.Spam is time squence data, As time goes on, the characteristics of the spam is changing,According to the streaming characteristics of spam, this paper mainly studies the spamfiltering methods based on the time series feature.The contents of this article are followings:(1) Feature representation. We introduce an improved SimHash fingerprint method, thatgenerates the similar signature for similar content. Compared with the conventional N-grammethod, fingerprint based on SimHash can distinguish the diversity of content as well as thebasic information.(2) Feature selection. Based on the analysis of differences and shortcomings of commonfeature selection methods, we propose the MIC feature selection algorithm. That takes intoaccount not only the degree of correlation between the features and categories, but aslo thefeatures’ characterization of the class.The features’ characterization can reflect the categorydistribution very well.(3) Time series feature based spam filtering algorithm.In this method, the weight offeature can adjust adaptively with time, that makes the weight of feature which is usedfrequently increased and the weight of feature which is used rarely reduced. We also take intoaccount the emerging spam feature which represents the latest trend in offline mode, weincrease the weight of emerging spam features to improve the classification accuracy byincreasing the weight of emerging spam feature.Finally, we integrate the improvements of each stage to design a time series bayesianfilter, and test in the latest standard data set SEWM2012. Experimental result shows that theoutcome is better than Structfilter.
Keywords/Search Tags:Spam filtering, Bayesian, fingerprint, feature selection, time series feature
PDF Full Text Request
Related items