Font Size: a A A

Spam Filtering Technology Research Based On Statistical Model

Posted on:2008-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:T WangFull Text:PDF
GTID:2178360215999607Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Electronic Mail (e-mail) is becoming one of the fastest and mosteconomical ways of communication available. While noted for their convenience,e-mails are abused by some peoples. Consequently, junk mails (also referred to as"spams") often invade e-mail systems. It has brought tremendous harms to e-mail usersand ISPs. So anti-spam is a problem crying out for solutions.Nowadays, the main anti-spam technologies include source-based ones andcontent-based ones. Combining the theory of data mining and machine learning intospam filtering, the content-based technologies can also be classified into rule-based onesand statistic-based ones. Among many statistic-based methods, Na(?)ve Bayes(NB) is asimple machine learning method. It learns by distinguishing the features of spam andham (normal or legal e-mail) and constructing statistical model. When a new e-mailcomes, the classifier will predict the probability of new e-mail belonging to spam orham by statistical laws which are learned by self-learning and training. Then the newe-mail will be classified as the class of the bigger probability. NB is widely used inspam filtering for its speed and simplicity to implement.Naive Bayes spam filtering (NBF) models have some key problems such as formalrepresentation of mail text, feature selection, probability computing model. Based onanalysis of the classic method, the article deeply studies these problems, and haseffectively improved the model and made some daring attempts.The main contents of this article are as following:1) A summary about the state of the spam filtering include definition, harm, andcharacter analysis of spam and frequently-used technologies of anti-spam.2) Introduction and a detailed analysis of commonly used e-mail corpora andcriteria of assessment.3) Elaborate analysis of implementation and key problems of the classic NBFmodels.4) Combining feature selection methods of text categorization with NBF models.Comprehensive analysis of every method's characteristics. Tests on Ling-Spam corpus found that the CHI method made NBF more effective.5) Owing to the limitation of the ECE feature selection method, the article puts outadvanced ECE method (AECE).6) After a comprehensive analysis of the common statistic computing models ofNBF, the article selected the optimal model by test. Based on optimal selection ofcomputing models and improving of feature selection methods with weighted features,the article put out Advanced Naive Bayes (A-NBF).7) After a comprehensive analysis of the characteristics of classic NBF based onRisk Minimization and with definition of a new risk factor, the article put forward a newNa(?)ve Bayes Filtering model based on Line Geometry Division (LGDNBF). The newrisk factor can describe risk of decision-making more precisely. The test resultsdemonstrate that LGDNBF has better performance.
Keywords/Search Tags:Spam Filtering, Naive Bayes, Risk Minimization, Risk Factor, Feature Selection
PDF Full Text Request
Related items