Classification Of Spam Based On The Semantics Of A Collection Of Models And Finite Automata

Posted on:2009-01-02

Degree:Master

Type:Thesis

Country:China

Candidate:E Q Zhou

Full Text:PDF

GTID:2208360245461355

Subject:Information security

Abstract/Summary:

PDF Full Text Request

According to the survey made by Anti-spam Center of ISC in 2007, China is still the third large country in terms of the total number of spam. Therefore the anti-spam research becomes more important. Nowadays, the main stream of anti-spam approaches is content based spam filtering. This approach can be divided into two categories: rule based techniques and machine learning & statistics based techniques. First, we present a method to improve the performance of rule based filtering systems by using the theory of automata machine. Second, because of the limitation of Vector Space Model which is widely used in machine learning & statistics based techniques, the rich semantic information of mails can not be used in spam classification. We propose and implement a novel Semantic Set Model (SSModel) and a high-quality algorithm based upon this model for Chinese spam filtering. We show the theoretical analysis and empirical evaluation on the CCERT data sets.Regarding to rule based approach, most of current systems, such as SpamAssassin, use the mechanism of perl regular expression as their matching method. However, the performance of such systems may decline drastically when the rule sets become bigger. We proposes a method to improve the performance of rule based filtering systems by using the theory of automata machine. Our solution solevs this problem more satisfactorily. At the same time, it can be used to build a more flexible system.There are two strong points in SSModel: first, we try to mine the relationship among the terms from raw mails and extract those semantic features into SSModel. This model offers a good foundation for subsequent process because its purpose is to capture some semantic information in natural language.Second, the acquisition of genuine messages for public usage is a big challenge all the time because of privacy issues. In this thesis, we build a "spam class" based upon SSModel, in which only spam samples were used, and presents a novel classification algorithm accordingly. To the best of our knowledge, our system is the first spam filtering system which is only built upon spam samples. Our experiments confirm that the SSModel based algorithm outperforms the previous approaches. SP(spam precision), SR(spam recall) and TCR(total cost ratio)are used to evaluate our system. With distance = 30, threshold = 5, following results can be gained: SP = 97.51%, SR = 93.34% , and TCR is 11.05 and 3.55 when A is set to 1 and 9 respectively.

Keywords/Search Tags:

Semantic Set Model, spam class, rule filtering, automata machine, Support Vector Machine

PDF Full Text Request

Related items

1	The Research Of Spam E-mail Filtering Technology
2	Spam Filter Based On Support Vector Machine Theory Model
3	SVM-Based Novel Method Of Online Spam Filtering
4	Research On Spam Filtering Of Particle Swarm Optimized SVM
5	Research On Text Classification Filtering Technology Based On Latent Semantic Indexing And Support Vector Machine
6	Application Research Of Image Spam Filtering Based On Wavelet And Support Vector Machine
7	Research On Spam-filtering Method Based On Visual Features Analysis
8	Design And Implementation Of The Email Spam Detection System Based On Naive Bayes And Svm
9	Design And Implementation Of The Email Spam Detection System Based On Naive Bayes And SVM
10	Research On Spam Filter Model Based On Support Vector Machine