Spam Filtering Technology Research Based On Statistical Model

Posted on:2008-04-11

Degree:Master

Type:Thesis

Country:China

Candidate:T Wang

Full Text:PDF

GTID:2178360215999607

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Electronic Mail (e-mail) is becoming one of the fastest and mosteconomical ways of communication available. While noted for their convenience,e-mails are abused by some peoples. Consequently, junk mails (also referred to as"spams") often invade e-mail systems. It has brought tremendous harms to e-mail usersand ISPs. So anti-spam is a problem crying out for solutions.Nowadays, the main anti-spam technologies include source-based ones andcontent-based ones. Combining the theory of data mining and machine learning intospam filtering, the content-based technologies can also be classified into rule-based onesand statistic-based ones. Among many statistic-based methods, Na(?)ve Bayes(NB) is asimple machine learning method. It learns by distinguishing the features of spam andham (normal or legal e-mail) and constructing statistical model. When a new e-mailcomes, the classifier will predict the probability of new e-mail belonging to spam orham by statistical laws which are learned by self-learning and training. Then the newe-mail will be classified as the class of the bigger probability. NB is widely used inspam filtering for its speed and simplicity to implement.Naive Bayes spam filtering (NBF) models have some key problems such as formalrepresentation of mail text, feature selection, probability computing model. Based onanalysis of the classic method, the article deeply studies these problems, and haseffectively improved the model and made some daring attempts.The main contents of this article are as following:1) A summary about the state of the spam filtering include definition, harm, andcharacter analysis of spam and frequently-used technologies of anti-spam.2) Introduction and a detailed analysis of commonly used e-mail corpora andcriteria of assessment.3) Elaborate analysis of implementation and key problems of the classic NBFmodels.4) Combining feature selection methods of text categorization with NBF models.Comprehensive analysis of every method's characteristics. Tests on Ling-Spam corpus found that the CHI method made NBF more effective.5) Owing to the limitation of the ECE feature selection method, the article puts outadvanced ECE method (AECE).6) After a comprehensive analysis of the common statistic computing models ofNBF, the article selected the optimal model by test. Based on optimal selection ofcomputing models and improving of feature selection methods with weighted features,the article put out Advanced Naive Bayes (A-NBF).7) After a comprehensive analysis of the characteristics of classic NBF based onRisk Minimization and with definition of a new risk factor, the article put forward a newNa(?)ve Bayes Filtering model based on Line Geometry Division (LGDNBF). The newrisk factor can describe risk of decision-making more precisely. The test resultsdemonstrate that LGDNBF has better performance.

Keywords/Search Tags:

Spam Filtering, Naive Bayes, Risk Minimization, Risk Factor, Feature Selection

PDF Full Text Request

Related items

1	The Study Of SPAM Filtering Method Based On Risk Minimization
2	The Study On Algorithm Of The Minimum Risk Bayes Spam Filtering
3	Spam Filtering Techniques, Based On Data Mining
4	Research On Spam Filtering Technologies Based On Content Characteristics Analysis
5	Research On Chinese Spam SMS Filtering Method Based On Rough Set And Naive Bayes
6	Research On SMS Filtering Technology On Intelligent Mobilephone
7	Research On Spam Filtering Technology Based On Bayesian Classification
8	Application Of Improved Naive Bayesalgorithm In Spam Filtering
9	Study On Spam Filtering Technology Based Bayes
10	Research On Spam Text Classification Based On Improved Naive Bayes Algorithm