Font Size: a A A

Classification Of Spam Based On The Semantics Of A Collection Of Models And Finite Automata

Posted on:2009-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:E Q ZhouFull Text:PDF
GTID:2208360245461355Subject:Information security
Abstract/Summary:PDF Full Text Request
According to the survey made by Anti-spam Center of ISC in 2007, China is still the third large country in terms of the total number of spam. Therefore the anti-spam research becomes more important. Nowadays, the main stream of anti-spam approaches is content based spam filtering. This approach can be divided into two categories: rule based techniques and machine learning & statistics based techniques. First, we present a method to improve the performance of rule based filtering systems by using the theory of automata machine. Second, because of the limitation of Vector Space Model which is widely used in machine learning & statistics based techniques, the rich semantic information of mails can not be used in spam classification. We propose and implement a novel Semantic Set Model (SSModel) and a high-quality algorithm based upon this model for Chinese spam filtering. We show the theoretical analysis and empirical evaluation on the CCERT data sets.Regarding to rule based approach, most of current systems, such as SpamAssassin, use the mechanism of perl regular expression as their matching method. However, the performance of such systems may decline drastically when the rule sets become bigger. We proposes a method to improve the performance of rule based filtering systems by using the theory of automata machine. Our solution solevs this problem more satisfactorily. At the same time, it can be used to build a more flexible system.There are two strong points in SSModel: first, we try to mine the relationship among the terms from raw mails and extract those semantic features into SSModel. This model offers a good foundation for subsequent process because its purpose is to capture some semantic information in natural language.Second, the acquisition of genuine messages for public usage is a big challenge all the time because of privacy issues. In this thesis, we build a "spam class" based upon SSModel, in which only spam samples were used, and presents a novel classification algorithm accordingly. To the best of our knowledge, our system is the first spam filtering system which is only built upon spam samples. Our experiments confirm that the SSModel based algorithm outperforms the previous approaches. SP(spam precision), SR(spam recall) and TCR(total cost ratio)are used to evaluate our system. With distance = 30, threshold = 5, following results can be gained: SP = 97.51%, SR = 93.34% , and TCR is 11.05 and 3.55 when A is set to 1 and 9 respectively.
Keywords/Search Tags:Semantic Set Model, spam class, rule filtering, automata machine, Support Vector Machine
PDF Full Text Request
Related items