Font Size: a A A

Research On Content-based Spam Filtering

Posted on:2011-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:X SunFull Text:PDF
GTID:2178360308454380Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development of the e-mail, the spam problems become more and more serious in our daily life. Fast variety development and extensive quantity increase will influence the quality of spam filtering. Spam has a lot of disadvantages, such as high propagation speed and potential damage, which has already become the globalization social effect of pollution. Spam must be handled effectively. The available spam-filtering technology includes regulation-based filtering and content-based filtering. The subjective factors are too much for the regulation-based fliteration, so this method has the weaker anti-interference ability, the establishment of the regulation will affect the mail filtering effect directly. The content-based filtering is one kind of main technology, which is adopted by current spam treatment. This technology can seek filter rules automatically by the relevance algorithm of text categorization . This paper will study the content-based filtering method.This paper are focus on how to improve the filter performance of the system from e-mail pre-processing, feature selection, and weight calculation side. This paper analysis the problems of currently existing algorithms of spam filtering, and the corrective measure are proposed. For the "the curse of dimensionality" problem in content-based spam filtering algorithm, a double feature selection method based on word frequency combined with other algorithms was proposed. It can effectively reduce the impact of redundant information and noise data for classification performance.Owing to the differences of spam and legitimate E-mail, a method of Categories–LDA model is introduced. That is generate a respectively model in different type of mail grounds, and search for the information composing each subjects characteristic. Categories–LDA model avoids performance degradation of the traditional LDA when it is ignore the difference between spam and legitimate email.There is no consensus of opinions among people as to face spam, and moreover, the types of spam change over time, the paper presents a method based on the Feedback-Random Forest algorithms which combines the advantages of Decision Tresses and Relevance Feedback, The method can be prompt catching the change trend to spam, establish the inner link between the customer and the filtering system, and the mail filtering system can be self-regulation. The results of experiment show that Categories-LDA by using Feedback-Random Forest algorithms can improve the performance of the e-mail filtering system more effectively. The system accuracy improve 2% on the 2005-Jun subset of CCERT corpus and the spam precision improve 3% on Trec06 corpus.
Keywords/Search Tags:Spam-filtering, Feedback, Feature selection, Feature weight calculation, LDA model
PDF Full Text Request
Related items