Spam Filtering Techniques, Based On Data Mining

Posted on:2010-12-14

Degree:Master

Type:Thesis

Country:China

Candidate:Z B Wang

Full Text:PDF

GTID:2208360278979051

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Convenient, instant, low-cost E-mail has already become the most popular means of communication and information exchange. However, spams have begun to harass users greatly. Spam mails, which always carries the commercial advertising, the malicious programs and the unhealthy content, take up network bandwidth and waste users' time and money, impact people's mood, and strongly harasses them. So spam filtering is a problem crying out for solutions and spam filtering technologies are developed and improved continuously.Among current popular spam filtering technologies, Naive Bayes(NB) is a simple statistics-based machine learning method. It learns by extracting the features of spam and ham (normal or legal e-mails) in testing set and then constructing a statistical model(the classifier). The classifier will predict the probability of the arriving new e-mail belonging to spam or ham by statistical laws. The new e-mail will be classified as the class of the bigger probability. NB is widely used in spam filtering for its speed and simplicity to implement.The key problems in Naive Bayes spam filtering (NBF) models are formal representation of mails, feature selection, probability computing models. Based on the analysis of the classic method, the article deeply studies these problems, and introduces the clustering method of text mining in data mining into the traditional filter model, and has effectively improved it.In the process of spam filtering, people would rather receive more junk e-mail than a normal e-mail is classified wrongly as spam, and the shortcomings of white list technology are as follows: a number of normal e-mails that are not sent by ones in white list system may be blocked. In this paper, this issue is studied thoroughly, and a two-tier model of spam filtering is proposed based on the white list technology and Bayesian classification.The paper first outlines the situation of spam filtering research, including the definition of spam, hazards, characteristic analysis and popular filtering techniques; then it analyses thoroughly the key issues of the traditional NBF model; in order to conquer the shortcomings of the white list technology, decrease the number of features in Bayesian filtering model and achieve the improvement of the traditional model of NBF, a two-tier spam filtering model, by applying the feature clustering method to reduce the dimension.

Keywords/Search Tags:

Spam Filtering, Naive Bayes, Feature Selection, Feature Clustering, Similarity

PDF Full Text Request

Related items

1	Research On Spam Filtering Technologies Based On Content Characteristics Analysis
2	Research On Spam Filtering Technology Based On Bayesian Classification
3	Research On Chinese Spam SMS Filtering Method Based On Rough Set And Naive Bayes
4	Spam Filtering Technology Research Based On Statistical Model
5	Application Of Improved Naive Bayesalgorithm In Spam Filtering
6	Research On Spam Text Classification Based On Improved Naive Bayes Algorithm
7	Rearch On Content-Based Spam Filtering Technology
8	Research On Content-Based Spam Filtering Technology
9	The Study On Algorithm Of The Minimum Risk Bayes Spam Filtering
10	The Research And Application Of Text Categorization Arithmetic In Spam Filtering