Research On Content-based Spam Filtering

Posted on:2011-10-26

Degree:Master

Type:Thesis

Country:China

Candidate:X Sun

Full Text:PDF

GTID:2178360308454380

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the development of the e-mail, the spam problems become more and more serious in our daily life. Fast variety development and extensive quantity increase will influence the quality of spam filtering. Spam has a lot of disadvantages, such as high propagation speed and potential damage, which has already become the globalization social effect of pollution. Spam must be handled effectively. The available spam-filtering technology includes regulation-based filtering and content-based filtering. The subjective factors are too much for the regulation-based fliteration, so this method has the weaker anti-interference ability, the establishment of the regulation will affect the mail filtering effect directly. The content-based filtering is one kind of main technology, which is adopted by current spam treatment. This technology can seek filter rules automatically by the relevance algorithm of text categorization . This paper will study the content-based filtering method.This paper are focus on how to improve the filter performance of the system from e-mail pre-processing, feature selection, and weight calculation side. This paper analysis the problems of currently existing algorithms of spam filtering, and the corrective measure are proposed. For the "the curse of dimensionality" problem in content-based spam filtering algorithm, a double feature selection method based on word frequency combined with other algorithms was proposed. It can effectively reduce the impact of redundant information and noise data for classification performance.Owing to the differences of spam and legitimate E-mail, a method of Categoriesâ€“LDA model is introduced. That is generate a respectively model in different type of mail grounds, and search for the information composing each subjects characteristic. Categoriesâ€“LDA model avoids performance degradation of the traditional LDA when it is ignore the difference between spam and legitimate email.There is no consensus of opinions among people as to face spam, and moreover, the types of spam change over time, the paper presents a method based on the Feedback-Random Forest algorithms which combines the advantages of Decision Tresses and Relevance Feedback, The method can be prompt catching the change trend to spam, establish the inner link between the customer and the filtering system, and the mail filtering system can be self-regulation. The results of experiment show that Categories-LDA by using Feedback-Random Forest algorithms can improve the performance of the e-mail filtering system more effectively. The system accuracy improve 2% on the 2005-Jun subset of CCERT corpus and the spam precision improve 3% on Trec06 corpus.

Keywords/Search Tags:

Spam-filtering, Feedback, Feature selection, Feature weight calculation, LDA model

PDF Full Text Request

Related items

1	Study On Spam Filtering Technology Based On IMI-WNB Algorithm
2	Research And Implementation Of Spam Filtering Technology Based On AAPE Classification Model
3	Spam Filtering Techniques, Based On Data Mining
4	A Research Of Spam Filtering Based On Text Mining
5	Research On Feature Selection Algorithm Of Spam Filtering
6	Spam Filtering Technology Research Based On Statistical Model
7	Research On Online Learning Based Spam Filtering
8	Research On Content-Based Spam Filtering Technology
9	Research On Content-Based Spam Filtering Technology
10	Research On Content-Based Spam Filtering Technology