Font Size: a A A

Content-based Indexing Of Spam Filter Research And Implementation

Posted on:2012-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2178330341950169Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In information age, spam is considered as the most effective and inexpensive internet advertisement forms. Some speculators distribute illegal information by using the spams. Spams greatly interfere people's daily life, consume their energy and time. More serious influences are: bringing information security hidden trouble, ruining ISP's (Internet Service Provider) market figure, loosing the intangible assets, and resulting in more and more danger to network resources and network security.This thesis mainly studies the Mail User Agent filter (the client filter). The author expanded the E-mail spam filter function of agent tool Microsoft Outlook, in order to achieve user's custom settings. Nowadays, the main approach of spreading spam is in a letter form, so this thesis focuses on content-based indexing of spam filter. Its basic processes can be briefly divided into two steps operation: training phase and classifying phase. The two phases also include five main steps: E-mail pretreatment, text representation, feature selection, classified prediction, and evaluation of E-mail filter quality. This thesis mainly research on feature selection and classification prediction that are spam filter's core steps. Firstly, we analyses eight kinds of common feature selection methods, they are: document frequency, information gain, mutual information, CHI statistics, expected cross entropy, the weight evidence for text, odd ratio and relevance score. Secondly, after deeply studying mutual information method, we have found that when feature words appeared in only one class, their mutual information values are equal with each other. It will lead to that the importance of feature words can't be distinguished. Started from this point, the improved mutual information is proposed. This new method uses the adjustment TFIDF weight function to balance the feature words'importance. Finally, this thesis researches two classification algorithms: Bayesian classifier and support vector machine (SVM). In the experimental analysis section,we selected the standard Ling-Spam mail collection. Experiments are made to compare the algorithms in four aspects: different feature selection methods, dimension, classifier and training set number, using F1 value and false ratio to evaluate. The results show that the stability of improved mutual information method is better than other algorithms. At last, using Microsoft Outlook's external program, we implement a spam filter system which can satisfy the function of spam filter.
Keywords/Search Tags:Spam, Feature Selection, Mutual Information, Classify, Bayesian
PDF Full Text Request
Related items